A Fast Introduction to Computer Vision

Computer vision applications have become quite ubiquitous in our lives. The applications are varied, ranging from apps that play Virtual Reality (VR) or Augmented Reality (AR) games to applications for scanning documents using smartphone cameras. On our smartphones, we have QR code scanning and face detection, and now we even have facial recognition techniques. Online, we can now search using images and find similar looking images. Photo sharing applications can identify people and make an album based on the friends or family found in the photos. Due to improvements in image stabilization techniques, even with shaky hands, we can create stable videos.

With the recent advancements in deep learning techniques, applications like image classification, object detection, tracking, and so on have become more accurate and this has led to the development of more complex autonomous systems, such as drones, self-driving cars, humanoids, and so on. Using deep learning, images can be transformed into more complex details; for example, images can be converted into Van Gogh style paintings.

Such progress in several domains makes a non-expert wonder, how computer vision is capable of inferring this information from images. The motivation lies in human perception and the way we can perform complex analyzes of the environment around us. We can estimate the closeness of, structure and shape of objects, and estimate the textures of a surface too. Even under different lights, we can identify objects and even recognize something if we have seen it before.

Considering these advancements and motivations, one of the basic questions that arises is what is computer vision? In this chapter, we will begin by answering this question and then provide a broader overview of the various sub-domains and applications within computer vision. Later in the chapter, we will start with basic image operations.

What constitutes computer vision?

In order to begin the discussion on computer vision, observe the following image:

Even if we have never done this activity before, we can clearly tell that the image is of people skiing in the snowy mountains on a cloudy day. This information that we perceive is quite complex and can be sub divided into more basic inferences for a computer vision system.

The most basic observation that we can get from an image is of the things or objects in it. In the previous image, the various things that we can see are trees, mountains, snow, sky, people, and so on. Extracting this information is often referred to as image classification, where we would like to label an image with a predefined set of categories. In this case, the labels are the things that we see in the image.

A wider observation that we can get from the previous image is landscape. We can tell that the image consists of Snow, Mountain, and Sky, as shown in the following image:

Although it is difficult to create exact boundaries for where the Snow, Mountain, and Sky are in the image, we can still identify approximate regions of the image for each of them. This is often termed as segmentation of an image, where we break it up into regions according to object occupancy.

Making our observation more concrete, we can further identify the exact boundaries of objects in the image, as shown in the following figure:

In the image, we see that people are doing different activities and as such have different shapes; some are sitting, some are standing, some are skiing. Even with this many variations, we can detect objects and can create bounding boxes around them. Only a few bounding boxes are shown in the image for understanding—we can observe much more than these.

While, in the image, we show rectangular bounding boxes around some objects, we are not categorizing what object is in the box. The next step would be to say the box contains a person. This combined observation of detecting and categorizing the box is often referred to as object detection.

Extending our observation of people and surroundings, we can say that different people in the image have different heights, even though some are nearer and others are farther from the camera. This is due to our intuitive understanding of image formation and the relations of objects. We know that a tree is usually much taller than a person, even if the trees in the image are shorter than the people nearer to the camera. Extracting the information about geometry in the image is another sub-field of computer vision, often referred to as image reconstruction.

Computer vision is everywhere

In the previous section, we developed an initial understanding of computer vision. With this understanding, there are several algorithms that have been developed and are used in industrial applications. Studying these not only improve our understanding of the system but can also seed newer ideas to improve overall systems.

In this section, we will extend our understanding of computer vision by looking at various applications and their problem formulations:

Image classification: In the past few years, categorizing images based on the objects within has gained popularity. This is due to advances in algorithms as well as the availability of large datasets. Deep learning algorithms for image classification have significantly improved the accuracy while being trained on datasets like ImageNet. We will study this dataset further in the next chapter. The trained model is often further used to improve other recognition algorithms like object detection, as well as image categorization in online applications. In this book, we will see how to create a simple algorithm to classify images using deep learning models.
Object detection: Not just self-driving cars, but robotics, automated retail stores, traffic detection, smartphone camera apps, image filters and many more applications use object detection. These also benefit from deep learning and vision techniques as well as the availability of large, annotated datasets. We saw an introduction to object detection in the previous section that produces bounding boxes around objects and also categorize what object is inside the box.
Object tracking: Following robots, surveillance cameras and people interaction are few of the several applications of object tracking. This consists of defining the location and keeps track of corresponding objects across a sequence of images.
Image geometry: This is often referred to as computing the depth of objects from the camera. There are several applications in this domain too. Smartphones apps are now capable of computing three-dimensional structures from the video created onboard. Using the three-dimensional reconstructed digital models, further extensions like AR or VR application are developed to interface the image world with the real world.
Image segmentation: This is creating cluster regions in images, such that one cluster has similar properties. The usual approach is to cluster image pixels belonging to the same object. Recent applications have grown in self-driving cars and healthcare analysis using image regions.
Image generation: These have a greater impact in the artistic domain, merging different image styles or generating completely new ones. Now, we can mix and merge Van Gogh's painting style with smartphone camera images to create images that appear as if they were painted in a similar style to Van Gogh's.

The field is quickly evolving, not only through making newer methods of image analysis but also finding newer applications where computer vision can be used. Therefore, applications are not just limited to those explained previously.

Developing vision applications requires significant knowledge of tools and techniques. In Chapter 2, Libraries, Development Platform, and Datasets, we will see a list of tools that helps in implementing vision techniques. One of the popular tools for this is OpenCV, which consists of most common algorithms of computer vision. For more recent techniques such as deep learning, Keras and TensorFlow can be used in creating applications.

Though we will see an introductory image operations in the next section, in Chapter 3, Image Filtering and Transformations in OpenCV, there are more elaborate image operations of filtering and transformations. These act as initial operations in many applications to remove unwanted information.

In Chapter 4, What is a Feature?, we will be introduced to the features of an image. There are several properties in an image such as corners, edges, and so on that can act as key points. These properties are used to find similarities between images. We will implement and understand common features and feature extractors.

The recent advances in vision techniques for image classification or object detection use advanced features that utilize deep-learning-based approaches. In Chapter 5, Convolutional Neural Networks, we will begin with understanding various components of a convolutional neural network and how it can be used to classify images.

Object detection, as explained before, is a more complex problem of both localizing the position of an object in an image as well as saying what type of object it is. This, therefore, requires more complex techniques, which we will see in Chapter 6, Feature-Based Object Detection, using TensorFlow.

If we would like to know the region of an object in an image, we need to perform image segmentation. In Chapter 7, Segmentation and Tracking, we will see some techniques for image segmentation using convolutional neural networks and also techniques for tracking multiple objects in a sequence of images or video.

Finally in Chapter 8, 3D Computer Vision, there is an introduction to image construction and an application of image geometry, such as visual odometry and visual slam.

Though we will introduce setting up OpenCV in the next chapter in detail, in the next section we will use OpenCV to perform basic image operations of reading and converting images. These operations will show how an image is represented in the digital world and what needs to be changed to improve image quality. More detailed image operations are covered in Chapter 3, Image Filtering and Transformations in OpenCV.

Getting started

In this section, we will see basic image operations for reading and writing images. We will also see how images are represented digitally.

Before we proceed further with image IO, let's see what an image is made up of in the digital world. An image is simply a two-dimensional array, with each cell of the array containing intensity values. A simple image is a black and white image with 0's representing white and 1's representing black. This is also referred to as a binary image. A further extension of this is dividing black and white into a broader grayscale with a range of 0 to 255. An image of this type, in the three-dimensional view, is as follows, where x and y are pixel locations and z is the intensity value:

This is a top view, but on viewing sideways we can see the variation in the intensities that make up the image:

We can see that there are several peaks and image intensities that are not smooth. Let's apply smoothing algorithm, the details for which can be seen in Chapter 3, Image Filtering and Transformations in OpenCV:

As we can see, pixel intensities form more continuous formations, even though there is no significant change in the object representation. The code to visualize this is as follows (the libraries required to visualize images are described in detail in the Chapter 2, Libraries, Development Platforms, and Datasets, separately):

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import cv2


# loads and read an image from path to file
img = cv2.imread('../figures/building_sm.png')

# convert the color to grayscale 
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# resize the image(optional)
gray = cv2.resize(gray, (160, 120))

# apply smoothing operation
gray = cv2.blur(gray,(3,3))

# create grid to plot using numpy
xx, yy = np.mgrid[0:gray.shape[0], 0:gray.shape[1]]

# create the figure
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_surface(xx, yy, gray ,rstride=1, cstride=1, cmap=plt.cm.gray,
        linewidth=1)
# show it
plt.show()

This code uses the following libraries: NumPy, OpenCV, and matplotlib.

In the further sections of this chapter we will see operations on images using their color properties. Please download the relevant images from the website to view them clearly.

Reading an image

An image, stored in digital format, consists of grid structure with each cell containing a value to represent image. In further sections, we will see different formats for images. For each format, the values represented in the grid cells will have different range of values.

To manipulate an image or use it for further processing, we need to load the image and use it as grid like structure. This is referred to as image input-output operations and we can use OpenCV library to read an image, as follows. Here, change the path to the image file according to use:

import cv2 

# loads and read an image from path to file
img = cv2.imread('../figures/flower.png')

# displays previous image 
cv2.imshow("Image",img)

# keeps the window open until a key is pressed
cv2.waitKey(0)

# clears all window buffers
cv2.destroyAllWindows()

The resulting image is shown in the following screenshot:

Here, we read the image in BGR color format where B is blue, G is green, and R is red. Each pixel in the output is collectively represented using the values of each of the colors. An example of the pixel location and its color values is shown in the previous figure bottom.

Image color conversions

An image is made up pixels and is usually visualized according to the value stored. There is also an additional property that makes different kinds of image. Each of the value stored in a pixel is linked to a fixed representation. For example, a pixel value of ten can represent gray intensity value ten or blue color intensity value 10 and so on. It is therefore important to understand different color types and their conversion. In this section, we will see color types and conversions using OpenCV:

Grayscale: This is a simple one channel image with values ranging from 0 to 255 that represent the intensity of pixels. The previous image can be converted to grayscale, as follows:

import cv2 

# loads and read an image from path to file
img = cv2.imread('../figures/flower.png')

# convert the color to grayscale 
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# displays previous image 
cv2.imshow("Image",gray)

# keeps the window open until a key is pressed
cv2.waitKey(0)

# clears all window buffers
cv2.destroyAllWindows()

The resulting image is as shown in the following screenshot:

HSV and HLS: These are another representation of color representing H is hue, S is saturation, V is value, and L is lightness. These are motivated by the human perception system. An example of image conversion for these is as follows:

# convert the color to hsv 
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

# convert the color to hls 
hls = cv2.cvtColor(img, cv2.COLOR_BGR2HLS)

This conversion is as shown in the following figure, where an input image read in BGR format is converted to each of the HLS (on left) and HSV (on right) color types:

LAB color space: Denoted L for lightness, A for green-red colors, and B for blue-yellow colors, this consists of all perceivable colors. This is used to convert between one type of color space (for example, RGB) to others (such as CMYK) because of its device independence properties. On devices where the format is different to that of the image that is sent, the incoming image color space is first converted to LAB and then to the corresponding space available on the device. The output of converting an RGB image is as follows: