From procedural recognition to YOLO. With single pass decoders, computer vision makes a generational leap; a look inside it.
Computer vision is one of the fields where Artificial Intelligence is expanding. Just think of the autonomous and driverless cars, where Tesla has been leading the way, and where all the other car manufacturers are now diving into.
For years, recognition and categorization have been a problem, especially considering the difficulty of a traditional algorithm to recognize the same object in different positions and angles. Considering how easy and spontaneous this task seems to be for us, realizing the problems encountered in automatic recognition is not so obvious.
We should distinguish two classes of problems: categorization and localization. The first one, relatively simpler, already presents some non-trivial difficulties.
It is easy for us, for example, to recognize a chair, but would you be able to describe it unequivocally? We could define it as a piece of furniture to sit on with four legs, armrests, and a backrest. However, looking at the image below, we already notice problems: some have only 3 legs, some even have only two, the red one in fact only one, the office one is on wheels, etc.
Yet for us, it is straightforward to identify them all as chairs. Teaching a machine to recognize them by presenting every possible exception is obviously impossible. Consequently, a rules-based recognition is doomed to produce unsatisfactory results at best, full of false positives (recognition of chairs where there are none) and negative (chairs not recognized as such). The problem becomes even more complicated if the objects are presented with different orientations, or with missing parts (see below).
Without digging too much into the history of automatic object recognition, we can say that before the era of deep learning, one of the most successful attempts at face recognition was Viola-Jones. This algorithm was relatively simple: first, a sort of map that represented the features of a face was generated, through thousands of simple binary classifiers using Haar Features. This map was then “wired” into the algorithm using it to train an SVM as a classifier to locate the face itself inside the scene. This algorithm was so simple and fast, that it is still used today in some low-end point-and-shoot cameras. However, this presented exactly the kind of problems described above, that is, they were not flexible enough to generalize objects presented with slight variations to the learning set.
More precise were algorithms such as Dalal and Triggs, which used HOG (histograms with oriented gradients). This, in addition to the edges, takes into account the orientation of the gradients in each portion of the image, and SVM for classification.
However, although it obtained much more precise results than the previous one, it was substantially slower. Furthermore, the main problem was still in the lack of robustness and the consequent difficulty in recognizing images with a certain amount of “noise” or distractions in the background.
Another problem with those algorithms was the ability to recognize only a single image, and they weren’t good at generalizing. In other words, they could be “configured” only on one type of images (faces, dogs, etc), they had great difficulty in the problems listed above, and the format of the images they could work on was very limited.
Deep Learning to the rescue
Actually, to be really useful, object recognition should be able to work on complex scenes, the like of the scenes we face in everyday life (below).
The expansion of the use of neural networks in the era of Big Data, and the consequent popularity of Deep Learning, really changed the game, especially thanks to the development of Convolutional Neural Networks (CNN).
A common approach to almost all the algorithms (including the previous ones) was that of the “sliding window”, that is to scan the whole image area zone by zone, analyzing a portion (the window) at a time.
In the case of CNN, the idea is repeating the process with different window sizes, obtaining for each of them a prediction of the content, with a degree of confidence. In the end, the predictions with a lower degree of confidence are discarded.
YOLO, the pioneer of Single Shot Decoders
Today, we need far more than a simple classification or localization in static images, what we need is real-time analysis: no one would want to sit in an autonomous car that takes several minutes (or even seconds) to recognize images!
The solution to the problem is to use single-pass convolutional networks, that is, analyzing all parts of the image in parallel, simultaneously, avoiding the need for sliding windows.
Yolo was developed by Redmon and Farhadi in 2015, during their doctorate. The concept is to resize the image so as to obtain a grid of square cells. In v3 (the last), YOLO makes predictions on 3 different scales, reducing the image of 32, 16 and 8 times respectively, in order to remain accurate even on smaller scales (previous versions had problems with small images). For each of the 3 scales, each cell is responsible for the prediction of 3 bounding boxes, using 3 anchor boxes (an anchor box is nothing but a rectangle of pre-defined proportions. They are used to have greater correspondence between predicted and expected bounding boxes. Here you can follow the excellent explanation of Andrew Ng).
Yolo v3 is able to work with 80 different classes. At the end of the processing, only the bounding boxes with the highest confidence are kept, discarding the others.
YOLO v3 is much more precise than previous versions, and despite being a bit slower, it remains one of the fastest algorithms around. The v3 uses as architecture a variant of Darknet, with 106 convolutional layers. Also interesting is Tiny YOLO, working on Tiny Darknet, and able to run on limited devices such as smartphones.
Below you can see a real-time footage of YOLO v3 at work.
 P. Viola and M. Jones: Rapid object detection using a boosted cascade of simple features, CVPR 2001.
 N. Dalal, B. Triggs: Histograms of Oriented Gradients for Human Detection, CVPR 2005.
 An anchor box is nothing more than a rectangle with pre-defined proportions. They are used to get a better fit between ground and expected bounding box (here you can follow the excellent explanation provided by Andrea Ng).
Redmon J, Farhadi A. – You Only Look Once: Unified, Real-Time Object Detection (arXiv:1506.02640v5)
Redmon J, Farhadi A. – YOLOv3: An Incremental Improvement (arXiv:1804.02767v1)
Wei Liu et al. – SSD: Single Shot MultiBox Detector (arXiv:1512.02325v5)