Convolutional neural networks are a form of multilayer neural networks. Here a typical CNN diagram is shown. The first part consists of convolution layers and a maximum pool layer, which act as an extractor of features. The second part consists of a fully connected layer that performs nonlinear transformations of the extracted features and acts as a classifier.

In the diagram, the input is fed into the network of sequential Conv, Pool and Dense layers . The output signal can be a softmax layer indicating whether there is a cat or something else. Also, as the output, a sigmoid layer can be used, the output of which will be the probability that the image will be a cat. Consider the layers in more detail.

A convolutional layer can be considered as the eyes of a convolutional neural network. Neurons in this layer are looking for certain features. Convolution can be considered as a weighted sum between two signals or functions. An example of a convolution operation on a 5 × 5 matrix with a 3 × 3 core is shown below. The convolution kernel slides across the matrix to get an activation map.

Suppose the input image is 32x32x3, i.e. this is a three-dimensional array of depth 3. Any convolution filter that we define on this layer must have a depth equal to the depth of the input. Therefore, we can choose convolution filters of depth 3 (for example, 3x3x3 or 5x5x3 or 7x7x3, etc.). Choose a 3x3x3 convolution filter, i.e. the convolution kernel will be a cube instead of a square.

If we can perform a convolution operation by moving the 3x3x3 filter onto the entire image of 32x32x3, then we will get an image with a resolution of 30x30x1. This is due to the fact that the convolution operation is not possible for a strip 2 pixels wide around the image. The filter is always inside the image and therefore 1 pixel is removed from the left, right, top and bottom of the image.

For an input image of 32x32x3 and a filter size of 3x3x3, we have a 30x30x1 location, and there is a neuron for each location. Then the outputs 30x30x1 or activation of all neurons are called activation maps. The activation map of one level serves as an entrance for the next layer. In our example, there are 30 × 30 = 900 neurons, because there are many places where the 3x3x3 filter can be applied. Unlike traditional neural networks, where the weights and thresholds of neurons are independent of each other, in the case of convolutional neural networks, neurons corresponding to one filter in a layer have the same weights and thresholds. In the above case, we shift the window 1 pixel at a time. We can also move the window by more than 1 pixel. This number is called the pitch.

As a rule, more than one filter in one layer of convolution is used. If we use 32 filters, we will have an activation card sized 30x30x32.

Note that all neurons associated with the same filter have the same weights and thresholds. Thus, the number of weights with 32 filters is 3x3x3x32 = 288, and the number of thresholds is 32. The picture shows 32 activation maps obtained from the use of convolutional kernels.

As you can see, after each convolution, the result is reduced in size (as in this case we move from 32 × 32 to 30 × 30). For convenience, the standard practice is to lay zeros on the boundary of the input layer so that the output is the same size as the input. So, in this example, if we add a supplement of size 1 on either side of the input layer, the size of the output level will be 32x32x32, which will simplify the implementation.

Consider how convolutional neural networks for analysing images.

In the figure above, the large squares indicate the area in which the convolution operation is performed, and the small squares indicate the output of the operation, which is just a number. The following notes should be noted:

  • In the first layer, a square, labelled 1, is obtained from the image area on which the ears are painted.
  • In the second layer, the square labelled 2 is obtained from the larger square in the first layer. The numbers in this square are derived from several areas from the input image. In particular, the entire area around the cat’s left ear is responsible for the value on the square marked 2.
  • Similarly, in the third layer, this cascade effect causes the square, labeled 3, to be obtained from a large area around the leg area.

From the above, it can be said that the initial layers analyze smaller areas of the image and, therefore, can detect only simple features such as edges/corners, etc. As we go deeper into the network, neurons receive information from larger parts of the image and from various other neurons. Thus, neurons in later layers can learn more complex functions, such as eyes, legs, etc.

The pooling layer is mainly used immediately after the convolutional layer to reduce its spatial size (only in width and height, and not in depth). This reduces the number of parameters, so the calculation decreases. Using fewer parameters avoids overtraining. Overfitting is a condition where a trained model works well with training data, but does not work very well in test data.

The most common form of pooling is maximum pooling, in which we take a size filter and apply a max operation with a certain part of the image.

The figure shows the maximum pool with a 2 × 2 filter size and step 2. The output is the maximum value in the 2 × 2 area, shown using surrounded numbers. The most common pooling operation is performed with a 2 × 2 filter with a step of 2. This significantly reduces the size of the input by half.