Let's say we have some input photos. All photos are in RGB and are stacked upon each other to give different colors to the image. For the sake of simplicity, let's say the images just have 1 channel, that is illuminance. Each pixel is given a value ranging from dark to light. Dark pixels are lower valued compared to lighter pixels. This gives us our first input. This is then fed into the CNN, which has some layers (there are hidden).
As we can see, the image shows the input in numbers for the number 5:
Now using this input, we take a 3x3 matrix as shown in the picture; this is called convoluting. The input is multiplied by some filter. Using this filter, we can get various kinds of data using the images, thus allowing our model to understand the image in terms of numbers.
Now there are multiple filters:
Filter 1: gets upper edges [[-1,-1,-1], [1,1,1], [0,0,0]]
Filter 2: gets left edges [[-1,1,0], [-1,1,0], [-1,1,0]] Filter 3: gets lower edges [[0,0,0], [1,1,1] ,[-1,-1,-1]] Filter 4 gets the right edges [[0,1,-1],[0,1,-1],[0,1,-1]]
After this, we can do something called Maxpooling where we take the max of values and get a new layer.
Similarly, another technique would be a dense layer, where after the max pool, we multiply it with random weights and get some value for activation. As we can see, Maxpooling helps reduce dimensions while retaining the important information. It also removes the noise in the inputs!
These are a few of the many layers that help CNN identify images and give sight to the machines. The CNN is made up of many layers and they all are interconnected, thus making it more accurate. Now it is even better than humans at observing things and can be used to digitize content like receipts into Excel sheets to see the expenditure, digitalize important documents into editable format with higher accuracy.
The model can also be told to tell how confident it feels about a prediction. CNN can also share parameters, and they can detect important features without human intervention. A lot of data is needed to train the model and it can take a really good system to train the model and the power needed is also a lot.
I would suggest the readers read more from the Keras, PyTorch, and TensorFlow Documentation!
Thank you By Dhruv Mahajan