Neural networks are a subset of machine learning, underneath which comes the concept of deep learning— Which I'm sure are terms we've all heard before at some point, but might've sounded too daunting to delve into the details of. Before getting into the intricacies of neural networks, let's first understand how this links to concepts that we're already relatively familiar with.
Machine learning is just that— Teaching a machine how to think and come to conclusions as a result of data retention and pattern recognition. To put it simply, the process of "teaching" a machine can be gone about in multiple ways. One of these ways is by developing a Neural Network; A concept inspired by the networks in the human brain.
Essentially, this method of ML involves a bunch of layers of information stored in units, called neurons. There are three main layers— The input layer, hidden layers (yes, plural) and an output layer. Each of these layers contain neurons carrying some value, and every neuron is connected to neurons of the succeeding layer. Each connection carries some 'weight', which is a factor which the value of the neuron is multiplied by. The weighted sum of every neuron in the focus layer is taken, and another value known as 'bias' is added to this weighted sum. As a whole, this is called the activation function. This process occurs layer-by-layer, until the output layer is reached and some value is obtained. This value is essentially a probability. The neuron in the output layer with the highest probability is taken as the output for this particular round, and compared with the true output.
If the output doesn't match the true output, then the weights and biases are adjusted dynamically after every round until the actual output is obtained. This is the training process. Once the model has been trained, it is ready to receive test data, using which it'll make predictions. We can deepen our understanding of this concept using an example.
Assume that we've got an image, 15x15px. This amounts to 225 pixels in the entire image. Now let's suppose that we've multiple images of some basic shapes, and we'd like the model to determine which shape is in the image.
If the shapes under consideration are a square, circle and triangle, we'll claim our output layer to consist of them. The layer right before the output layer (nth hidden layer) would contain the basic features that make up each of these shapes— a horizontal line, a vertical line, a curve, and diagonal lines. If the weighted sum of the horizontal line + vertical line features are the highest, it implies that the shape in question is a square. Similarly, if the weighted sum of the horizontal line + diagonal lines features are the highest, the shape in question would be a triangle.
Now, the layer before the nth hidden layer [(n-1)th hidden layer] would have the features of the nth hidden layer's components, i.e., it would have the features (horizontal line, vertical line, etc) broken down into smaller segments/smaller features. So, it would contain a fraction of the pixels that make up each of the lines in the nth hidden layer. To illustrate this visually, we can say that a horizontal line may be divided into halves. Let's say that this horizontal line is the base of a square. This implies that the bottom couple of pixels would be occupied by the horizontal line. We have now divided this horizontal line into two halves, and similarly, every other feature (curve, diagonal line, etc.) into halves as well. These halves make up the (n-1)th hidden layer. When we take the weighted sum of these neurons, if we get the maximum values as those of the two haves of the horizonal line, we'll know that that's the feature to be selected in the nth hidden layer.
This process is repeated for every single layer that exists in the network, and it's all dynamically amended. Ergo, we can obtain very accurate results as the number of iterations increase.
Now, as for how exactly this works on the numerical level, now that we've understood the basic concept— Let's say that the image containing the shape has a black background, and the shape is hand-drawn in white. Every pixel will have an assigned value between 0 and 1 (to align with the whole 'probability' concept, and for simplicity's sake), where 0 stands for a fully black pixel, 1 stands for a fully white pixel, and any decimal values in between are grey areas. The probability values of fully white/close to fully white pixels will obviously be greater than that of grey pixels. Using this, we can conclude that grey pixels are like the borders to our features (such as the lines, as discussed earlier). The darker the grey of the pixel visually, the lower its value gets, and its probability of being considered a border region increases. This process is called edge detection. The values of the neurons are called activations. It's worth noting that weights may be a positive or a negative value. The weights associated with white pixels/neurons will be on the positive side, and the weights of all the surrounding pixels or the border, will be more negative.
Let's look at some more math here. We now have our weighted sum. But what we don't have, is the guarantee that this sum will be a number between 0 and 1— But we need a number between 0 and 1, since we're just taking the whole concept as a function of probability (i.e., we're essentially finding the probability of the given shape being one of our three output shapes. Also, the activation of every neuron is always a value between 0 and 1). In order to make sure that the weighted sum lies within this range, we somehow need to 'smoosh' the weighted sum function into the range [0, 1]. To do this, we use the sigmoid function (which we used in logistic regression in an older task!). This function throws outputs as follows: Very negative inputs end up being close to 0, and very positive inputs are closer to 1, and that's that.
I'd also like to recall that we included a 'bias' when we first took our weighted sum. Let's look at why we need this parameter in our already annoying-looking equation. We can establish from all the yapping above, that the activation of a neuron multiplied by some weight, will basically give us a measure of how positive the activation of a neuron in the next layer is, i.e., it'll tell us how high the probability of aq certain feature existing in the next layer is. For example, if the value of the neuron containing the upper half of a diagonal line is high, then in the next layer (which contains full lines), we can conclude that the value of the neuron containing a full diagonal line is high. This is convenient for simple networks. However, with networks containing identical features and more details, we might want neuron activations which have slightly higher values. For our three-shape example, we may be happy if our weighted sum is some value above 0. But in other instances where most values of the sum are above 0, our threshold might be that we need a sum above 10. To achieve this, we add a bias to the vanilla weighted sum. So this ensures that a certain neuron in the next layer only gains activation when (weighted sum + bias > 10).
To sum this part up, the weight tells us how positive a neuron's value is, and the bias gives us a meaningful threshold to actually activate a neuron in the next layer. It is important to note that these weighted sum and bias shenanigans are performed before plugging the entire value into the sigmoid function, which, incidentally, will produce a logistic curve when plotted.
We'll now discuss a concept I should've talked about before but genuinely forgot to add: How many neurons and weights and biases exist in each layer, or between one layer to the next?