How computer vision or AI vision works

Machine vision has changed dramatically in the wake of its convergence with artificial intelligence. Traditional computer vision systems were dependent on predefined library functions and user-defined algorithms for all kinds of image processing tasks. In the past decade, researchers realized that human vision itself is inextricable with the human mind and learning. It was that realization that lead to the use of deep learning networks in computer vision such as Convolutional Neural Networks (CNN) or other deep learning method. Artificial Intelligence vision is the new, fancy term for computer vision or machine vision.

AI vision has been computationally expensive as well as complicated. As artificial intelligence has advanced and semiconductor chips have become increasingly more powerful, it’s possible to implement deep learning networks on mobile processors and embedded systems as simple as 32-bit microcontrollers. This has moved computer vision models from clouds to edge devices, where computer vision can be used for embedded applications and a variety of narrow AI tasks. The market size of AI vision is expected to reach $48.6 billion in 2022 and is projected to reach $144 billion by 2028. Computer vision is deemed to shape the new world UX technology, where pocket-sized computers in smartphones, gadgets, wearables, and ‘things’ will have a human-like vision and higher intelligence.

This article will discuss how state-of-the-art computer vision systems work and why computer vision is increasingly more important.

What is computer vision?
Computer vision is a field of artificial intelligence that deals with training computers to perceive and comprehend visual information from images, videos, and other visual inputs. AI vision esteems to be as natural as human vision. Though, both are very different from each other. The human eye is estimated to have a resolution of 576 megapixels, and all that load of visual information is processed and analyzed by a highly complex network of brain neurons. Even supercomputers are far behind in computational speed compared to brain neurons, and the most advanced cameras do not have a resolution matching a human eye.

It is quite trivial to capture visual information in images, videos, or live streams using cameras and sensors when it comes to computer vision. The real challenge is to derive meaningful insights and inferences from the captured visual data computationally. This is where machine learning and deep learning are utilized. The real world is infinitely complex and varying such that a computer vision system can only succeed if it is capable of learning from visual information.

If human vision is an evolutionary masterpiece, computer vision has its pros. Cameras can capture visual data with better efficiency than human eyes. They can also capture visual information that cannot be accessed by human eyes, like thermal images, medical scans, and other imaging technologies. Computer vision systems can be designed to be more specific, precise, and accurate than human vision. For example, deep face recognition models have achieved a detection accuracy of 99.63% compared to human accuracy of 97.53%.

Computer vision tasks
Before knowing how a computer vision system works, it is important to be familiar with common computer vision tasks. These simple visual perception tasks help segregate a large-scale application into more straightforward problem statements. Each task requires some cognitive functionality for its execution.

Image classification: Image classification is a fundamental task in computer vision applications. It involves training a neural network to classify images by predefined categories. This usually involves classification by specific objects. For example, this is an image of a cat, an image of a dog. If the classification has to be done between only two objects, it is called a binary classification problem. If the classification has to be done among multiple objects, it is called a multi-classification problem. In an image classification problem, the entire image is processed as a whole, and an exclusive class/label is assigned to a given image.

Image classification is a problem of supervised learning. The model is trained to classify images using a set of sample images already tagged/classified. Once the training is done, a group of images has to be classified by predefined labels/classes. An image classification model can easily underfit without enough training data. That is why transfer learning or knowledge transfer is often used in image classification models. An already trained machine learning model is reused to classify similar objects in the transfer learning method. This enables building scalable solutions within a small computational footprint. Image classification is often called object classification in AI jargon.

Object detection is the first step in extracting features from an image. While image classification is limited to categorizing images into exclusive classes, object detection involves analyzing parts of the image to localize objects in them using bounding boxes. This is done by looking for class-specific details within an image, localizing objects/classes within the image/video, and labeling them by their class names. An image can contain multiple objects, and an object detection model may look for several classes within an image.

Object detection is used in computer vision problems like object identification, object verification, and object recognition. Compared to machine learning approaches like SIFT, HOG features, and Haar features, deep learning models like RCNN, YOLO, SSD, and MobileNet are more accurate and performant in object detection tasks.

Image segmentation: This involves the exact masking of pixels representing an object within an image. This requires discerning the object from its background and other objects in the image. Several machine learning and deep learning methods are used for image segmentation. Common machine learning methods applied for image segmentation include clustering, watershed, edge detection, region growing, region split and merge, and threshold. Typical deep learning models used for image segmentation include FPN, SegNet, PSPNet, and U-Net.
Object landmark detection: This is similar to image segmentation. Instead of the object itself, its context or landmark is identified in this task. This involves discerning the object’s background within the image and assigning a class to the background instead of the object.
Edge detection: In this task, the boundaries of an object are detected within the image. Often this is a pre-processing step in image segmentation that is internally performed by a specialized edge detection filter within a convolution network. In many computer vision systems, this is a part of pre-processing images where edge detection is carried out by applying a machine learning algorithm.

Feature extraction and matching: Features are the internal indicators of an object. Feature extraction involves the identification of the parts of an object. This is quite useful in object detection, pose estimation, and camera calibration problems. First, features of interest are detected in an image using edge detection or other feature extraction methods. This is followed by the localization of those features with the help of local descriptors. Finally, the features and their local descriptors are matched among a group of images for feature matching.
Face recognition: This type of object detection task where the object to be detected or recognized is a unique human face. In a face recognition task, features of an image are extracted, localized, classified, and matched to derive an exclusive classification of the image itself. For example, facial features like eyes, nose, mouth, ears are identified, localized in the image, compared positions with an absolute mathematical model, and features are matched to perform identification of a person.
Optical character recognition: In this computer vision task, the characters of a language are to be identified in an image. These could be the images of number plates or handwritten notes. OCR involves image segmentation over letters of a language and is often accompanied by the meaningful encoding of text for a given application.
Image restoration: This task involves restoring old images to revive their quality and/or adding colors to old black & white photos. This is done by reducing additive noise in the image and performing image inpainting to restore damaged pixels or parts of the image. This may follow the colorization of the image in black & white pictures.
Pose estimation: In this computer vision task, the posture of an object/human is identified. This involves identifying features, their localization within the image, and comparison of localized positions of the features for each other within the image. Common deep learning models used for pose detection include PoseNet, MeTRAbs, OpenPose, and DensePose.
Video motion analysis: This computer vision task involves tracing the trajectory of an object in a video or camera stream and determining its velocity, path, and motion. This much-complicated task involves object detection, segmentation, localization, pose estimation, and real-time tracking.
Scene reconstruction: This is the most complex task in computer vision. It involves a 3D reconstruction of an object from 2D images or videos.

How computer vision works
A computer vision system has three levels of operation as follows.

Acquiring images: First of all, a computer vision system acquires images or videos or other forms of visual input (like scans) from a camera or sensor. The captured images/video/streams are transferred to a computer system stored for further processing.
Processing the images: The raw images need to be prepared to represent appropriate data. This is done by pre-processing images like reducing noise, adjusting contrast, re-scaling, and cropping the images. Most of these jobs are automated within a computer vision system. Some of these steps are already performed at the hardware level. In contrast, others are performed using suitable filters within a convolution network or applying suitable image processing functions to the captured raw data.
Understanding images: This is the most important part of a computer vision system. It is the implementation of the actual computer vision task using either a conventional style of image processing or with the help of a deep learning model.

Artificial intelligence has made the conventional image processing style in computer vision obsolete. The deep learning network is a sure recipe for any computer vision problem.

The first step in understanding images is feature engineering. The captured images are converted into arrays of pixels. The images take a lot of data for their computational representation, and the colored images need a good memory for their storage and interpretation within a model. Following a proper computational presentation of the images, the parts of the images are identified as objects using blobs, edges, and corners. This is a CPU-intensive and time-consuming process. That is why object detection is automated using transfer learning. Major companies working on computer vision and AI have shared their datasets and deep learning models as open-source assets to ease and automate the process of object detection in images.

This is followed by training of the convolution networks for the domain-specific tasks. Each computer vision application/task requires a specific dataset. For example, a traffic monitoring application will need a dataset to identify and classify vehicles. A cancer detection application will need a dataset of medical scans and reports. How a dataset is utilized for training a neural network model depends on the computer vision tasks involved in the particular application. Accordingly, appropriate deep learning models are applied, and associated performance metrics are monitored.

Challenges in computer vision
There are several challenges in computer vision applications. Often these challenges are related to the acquisition of the images, feature engineering, or interpretation of the visual data. For example, the difference in lighting will naturally compromise a computer vision application that depends upon identifying the object’s colors or image. The presence of noise or unwanted features in images is another common problem in computer vision applications. Due to real-life circumstances, these undesirable features or noise are often added to images/videos. For example, the images captured by a surveillance camera get blurred due to rain or dust storms. Similarly, overlapped objects in an image are always difficult to identify.

Another set of challenges appears in the selection of feature engineering. The physical world is enormously varying and versatile that the choice of appropriate features for extraction and matching in a given application can become a daunting task. For example, the same object looks different from different angles. The same class of object may have a variety of colors and internal features. For example, a cat may look different from different angles of view; cats of the same breed have different skin colors and patches, and cats of the different breeds have similar but different body features. Therefore, a deep learning model must be supplied images of the object from different angles and different variations to avoid underfitting. It is even possible that two objects may have similar features and resemblances, resulting in fake similarities. That is why thousands or millions of images are required to train a deep learning network to identify an object. This usually involves detecting and matching hundreds of features for the same class/label.

Finally, a computer vision system may fail due to improper or insufficient visual data interpretation. This usually happens due to a lack of context or general intelligence in computer vision networks. After all, computer vision systems rely on pattern identification in images. They can interpret images only in the context supplied to them or within the confines of the set of features utilized in feature engineering. Deep learning networks can derive meaningful representations of the objects/classes over convolution but cannot generate contexts and references.

Conclusion
Convolutional neural networks have brought a revolution in the field of computer vision. Machine vision has also moved from cloud to edge computing with the advancements in computational technology. Empowered with artificial intelligence and sophisticated chip technology, computer vision is now applicable in many domains. The development of computer vision and artificial intelligence will continue hand in hand. Computer vision has already evolved out of its limited capacity. Its future lies in further developments in the area of artificial general intelligence.