Introduction

Modern image classification models have been able to achieve accuracies of over 90% on the ImageNet-1k benchmark dataset (which has 1000 classes). This is very impressive, given that the model is basically just a bunch of numbers. Everything a classification model does can be reduced to math, even its input data. Images are just rows and columns of numbers that are represented by colors (or colors that can be represented by numbers); the math and intuition behind image classification models are built on top of neural networks. This article explains how explains how neural networks become capable of "seeing“.

The cover image is from CNN Explainer by the way

Image classification models are based on a type of neural network architecture called convolutional neural networks. Why aren’t normal neural networks enough, though? The article covers the need for a whole different architecture when dealing with image data and why the normal neural network architecture is insufficient. This article would start with a brief overview of normal neural networks, and then I would introduce how image data are fed as input into neural networks before going into the CNN architecture. Let’s get into it, shall we……

Neural Networks and how they learn

The basic neural network architecture works based on two main mathematical concepts: matrix multiplication for forward propagation and the chain rule for back propagation. They are also evaluated using a loss function. A neural network learns by firstly guessing (by propagating the input forward through the network); then, based on the loss (how good the guess is, calculated using the loss function), the model updates its weights by calculating how its loss (performance) changes with change in weights and then propagating this backwards through the network.

This is a very high-level view of neural networks, and for people that are new to neural networks, this might seem unclear. I would lay out the 3 stages of training and explain them briefly:

Forward propagation: This is the stage where the neural network makes a guess using the input data. A single neuron in a neural network multiplies the input by a weight; this weighted value is then added to a bias before going through an activation function. Basically a = f(z) and z = Wx + b. There are usually multiple neurons, and there are also multiple 'layers’ of neurons, so this operation happens a lot before eventually reaching the output. Note: Forward propagation is also how the neural network makes predictions after being trained (I am calling its initial prediction a guess because it makes the prediction with randomly generated weights or zero weights sometimes).
Loss calculation: After making a guess, the neural network evaluates this guess by calculating how close the guess is to the actual value (or the expected output). The loss function calculates the loss of our neural network. Note: As we train the model and as the model gets better, the loss is expected to reduce; this is a sign that the model is actually learning and that it gets better at making predictions that are close to the expected output.
Backpropagation: The rate at which the loss changes with change in weights—the gradient of the loss with respect to the weights (this includes both the weight and bias of the neurons)—is then calculated. Knowing how the loss changes with respect to the weights, we can update the weights in such a way that the loss reduces. This would directly affect our predictions, making our “guesses“ closer to the actual output.

This is the basics of what happens in a neural network. For a deeper look into neural networks, check out one of my other articles where I explain neural networks and the math they use to learn here.

This is a representation of a neural network (what they look like conceptually). Each circle is a neuron, and each row of neurons is a layer.

Source: Geeks for geeks

How regular Neural networks handle images

Before we go into how we can feed an image as input into a neural network, we must understand what exactly an image is. An image is a matrix of pixels, each having a particular color. The color of a pixel is determined by its numerical value. The pixels in grayscale images have values ranging from 0 to 1, with 0 meaning a black pixel, 1 representing a white pixel, and of course 0.5 being a gray pixel. Let us show you a black and white image here (each pixel is either 1 or 0).

This is a 6 by 6 black and white image of the number 1.

and this is one for 0.

As you can see, each pixel has a color that correlates to a number; it is either black (0) or white (1).

Why Neural networks struggle with images

How do we feed this into a neural network?

Let us reduce the images first:

Now they are both just 3 by 3 images; we can feed this into a neural network by taking all the pixels from top to bottom, left to right, and arranging them top to bottom, like so:

Now we have 9 values that we can feed into a neural network with 9 input neurons; each value would go into one neuron. This is how images are fed into neural networks; this process of turning a matrix into 1 column is called flattening. This is an important operation, as it is still used in CNNs.

The input layer would then send the pixel values to the hidden layer, where forward propagation takes place and the model learns the right weights and biases to predict whether the number is a 1 or a 0.

For the original 6×6 (36) image we would need 36 input neurons, and for actual real-world images we would need 224×224 (50176); this is a lot, and this is also unnecessary and not very effective. Treating images as individual pixels may work for these simple black and white images, but it wouldn’t work well for actual real-world images. A dense neural network does not know that neighboring pixels are related; it just sees a long list of numbers, which is kind of counterintuitive when dealing with image data.

Real-world colored images also have rows and columns of pixels, but these pixels have 3 channels: red, green, and blue. They therefore have 3 values, each corresponding to the intensity of that particular color. Here is an example:

Each pixel has three values, R, G, and B, that define its color. If you look closer at the image above, you would notice that the pixels where the red value is significantly higher are ‘redder,’ and the same for green and blue. In reality the values range from 0 to 255.

As previously stated, each pixel has 3 different dimensions whose intensity is defined by a specific value. These dimensions (red, green, and blue) are called channels, and they can translate to images of their own. Let us split the above image into its respective channels:

This might not make sense, so let us get an actual image then:

If you’re wondering what this is, it is The Persistence of Time by Salvador Dali.

Now let’s get its channels:

What happens when we feed this image into a normal neural network? The image would be flattened; all spatial relationships between pixels would be lost, and all the channels would just be stacked on top of each other. It looks like gibberish (visual gibberish, I guess):

I think we have bullied normal neural networks enough; they are good for some applications, but not image recognition. Now let us look into CNNs. First, we would try to figure out how humans recognize images.

How CNNs handle images

Humans don’t recognize images by considering individual pixels; instead, we consider groups of pixels that add up to features like fur, ears, legs, paws, eyes, whiskers, and so on (in the context of recognizing cats and dogs). There are some building blocks of features, though, but our brains are advanced enough to have abstracted the low-level things, making us capable of recognizing features or even patterns at first glance without being aware of the low-level things. Machines would have to learn to recognize images by explicitly extracting features from images. These features would have to be extracted step by step by first considering the building blocks of these features. What are these building blocks?

Let’s think about the most basic combination of pixels, an edge. Most features have edges that separate them from everything else (the background); we could have vertical edges or horizontal edges, or even curved edges. These edges can then add up to corners, and corners and edges can add up to shapes. Certain color combinations (pixel groups) could also add to textures; shapes and textures are what lead to features.

Think about it; let’s take an eye, for example. It has corners and edges that add up to an oval shape, then the variations of colors inside the oval shape for the cornea, pupil, and iris. This is what makes up an eye, and it can be reduced to textures, colors, and shapes.

Now, I know what you might be thinking: how does the machine ‘see‘ these building blocks?

Let’s focus on edges, the most basic of the building blocks for features. We can extract edges (features in general, actually) from images by applying a filter to the image. A filter is simply a small matrix of numbers that is applied locally across an image; it performs an operation on the pixels of the image and then results in another image where some parts are more defined than others, this is known as a feature map. This filter (as the name implies) extracts particular features from images, from edges to corners to shapes, textures, and so on.

A filter is a small matrix of weights that slides across an image and performs local computation at each step. At each location, the filter is multiplied element-wise with a small patch of the image (this operation is done with the pixel values; for 3-channel images, the channels are operated upon individually). This operation produces a single number that represents how strongly that feature is present at that location.

This operation is called convolution. Now you know where the name comes from.

Edge detection with convolution

Consider the vertical edge detection filter:

K = [-1,0,1],
    [-1,0,1],
    [-1,0,1],

When this filter slides across an image regions with strong vertical changes (vertical lines) produce large values, while regions with horizontal changes produce values near zero. The resulting image is a feature map that highlights vertical edges. Below is the feature map of ‘The persistence of time‘ for vertical edges.

So the feature map (far right) now highlights the regions where edges were detected. this convolution happens multiple times, filter would eventually detect high level features like shapes, textures, body parts and so on.

How CNNs learn filters

CNN filters are not predefined, they are learnt hrought the process of training the CNN. CNN filters can also be treated as weights in Neural Network, the implication of this is that they can also be optimized through gradient descent just like regular weights. CNNs can therefore learn filters that would detect features that are particular to one class of images, making them capable of detecting patterns and recognizing images ( cats, faces, and so on..).

convolution usually happens in layers, just like normal neural networks. Early convolutional layers would typically learn edges, color gradients and simple textures; while deeper layers would learn corners, object parts and patterns. This happens automatically through gradient descent as previously stated.

As helpful as Convolutional layers can be, their output needs more processing. The main goal is to reduce these features to single numerical values that can be flattened and then passed through a normal dense layer.

If we are able to reduce whether an image has whiskers or not to a number, then we can then use a dense layer to make predicitions on whether that image is of a cat or not. So we want these feature maps to eventually have a much smaller dimension, that we can then use to make predictions. The dimensions of images can be reduced by an operation known as Pooling.

Pooling: Reducing Spatial complexity

there are different types of pooling, the most common one though is Max pooling This basically takes a region of pixels (e.g 2X2 region) and then replaces it with the max value. for the 2X2 region example, the feature map would be reduced to half of its original dimension. This is the feature map of the edge detection after Max pooling:

THe dimensions of the image is cut in half and the edges are now more defined than they were in the original feature map.

From Feature Maps to Predictions

After multiple convolution and pooling layers:

Feature maps are flattened into a vector
This vector is passed to dense layers

These dense layers:

Combine all extracted features
Learn global relationships
Produce the final classification

Convolutional layers extract features, while dense layers perform reasoning and decision-making.

The Full CNN Architecture

A typical CNN consists of:

Input Layer
- Image tensor: H×W×CH \times W \times CH×W×C
Convolutional Layers
- Feature extraction
- Learnable filters
Activation Functions (ReLU)
- Introduce non-linearity
Pooling Layers
- Reduce spatial dimensions
Flattening Layer
- Converts feature maps into vectors
Fully Connected (Dense) Layers
- Perform operations on detected features (flattened feature maps) to make decisions
Output Layer
- Softmax (for multiclass classification) or sigmoid (for binary classification)

Here is a visualization of what happens in a CNN, based on the above steps:

How Convolutional Neural Networks Extract Features from Images

Introduction

Neural Networks and how they learn