Convolutional Neural Networks: recognition and categorization by progressive abstraction
The Convolute Neural Networks or ConvNet (CNN) are among the most commonly used Deep Learning algorithms in computer vision. They are applied in many fields, from autonomous cars to drones, from medical diagnoses to support and treatment for the visually impaired.
What is a convolution?
Wolfram Alpha explains it this way:
A convolution is an integral that expresses the amount of overlap of one function as it is shifted over another function . It therefore “blends” one function with another. For example, in synthesis imaging, the measured dirty map is a convolution of the “true” CLEAN map with the dirty beam (the Fourier transform of the sampling distribution).
Mathematically speaking, it means “sliding” a function (blue) over another (red), actually “mixing” them together. The result will be another function (green), representing the product between the two functions.
Convolutions applied to neural networks
This is a strictly mathematical definition of the process, but how does it matter to us?
In the case of image analysis, for example, the red function represents the image analyzed in input, while the second (blue) is known as “filter”, because it identifies a particular signal or structure in the image.
In simpler words, analyzing an image, a first step could be to recognize the silhouettes of the figures, therefore (simplifying) we would find a filter for vertical lines, one for the horizontal, one for the diagonals, all three sliding on the whole image.
Subsequent layers could recognize eyes, ears, hands, etc. Eventually, the last layer could be able to recognize and identify sheep, people, cars.
First of all, it might be worth understanding why we would invent complex algorithms like convolutions: would not it be better to just go with a fully connected network? In the end, with a fully connected network, there is no information loss. The problem is that a fully connected network leads to a combinatorial explosion of the number of nodes and connections required1.
Convolutions and filters
Convolutional neural networks work like any other neural networks: an input layer, one or more hidden layers, performing calculations via activation functions, and an output layer with the result. The difference stands precisely in the convolutions.
Each layer hosts the “feature map”, which is the specific feature that each node scan for. In the example below, the first layer could be used to encode the vertical and horizontal lines, then we slide a specific filter (here in the example a 2 × 2) thru the image, and multiply it (scalar product) for the area below.
This multiplication matches with the function we saw at the beginning of the article, where the input image corresponds to the red curve, the blue filter, and the green feature map. Basically, it means that inside the convolutional layer each node is mapped only on a subset of input nodes (receptive field), and in fact multiplying the filter for the receptive field of each node is conceptually equivalent to “sliding” the long filter the input image (windowing).
The result will be another matrix, slightly smaller than the original image (or the same size if using zero padding2), called the feature map. For each neuron layer, several filters may be applied, thus generating multiple feature maps.
Typically each convolutional layer is followed by a Max-Pooling3 one, gradually reducing the size of the matrix, but increasing the level of “abstraction”. We then move from elementary filters, such as vertical and horizontal lines, to gradually more sophisticated filters, which might be able to recognize the lights, the windshield… up to the last level where it might be able to distinguish a car from a truck.
1. Just to give us an idea, a very small color image 32 × 32 pixels, would have in the first layer already 3072 nodes (32 × 32 pixels x 3 color channels), with 3072 connections each (more than 9 million total). An image of a more reasonable 1000×1000 size would lead to 1M of nodes with 1m of connections each, or a total of 1012 connections! It is therefore evident that a solution of this type is not scalable and totally unrealistic.
2. Zero padding is a technique that consists in adding (see below) to the image a “border” of zeros (see below), in order to preserve the size of the image exiting the layer, in order not to lose information.
3. Max-pooling is a method to reduce the size of an image, dividing it into blocks and keeping only the one with the highest value. In this way, the overfitting problem is reduced and only the areas with greater activation are maintained.