Convolutional neural networks have played a critical role in solving image classification problems and is typically the first type of machine learning model considered for building image classifiers.
It’s important to understand the key component that drives convolutional neural networks and why it performs so well in image classification. The convolutional layer is the defining feature of this type of neural network. This layer is responsible for learning features that belong to a class. It accomplishes this by applying a filter (kernel) on the receptive fields (local regions) of the input image, resulting in a feature map. Mathematically, this operation is done by applying the element-wise product between all of the receptive fields and the kernel, then summing every element in each receptive field, and finally mapping the summed values to the feature map.
Let A represent a grayscale image and B be a 2x2 kernel used for detecting vertical edges in A. If we perform convolution with a stride of 2 and no padding, then our receptive fields would be the four 2x2 corners of A shown below.
Apply the kernel to each receptive field and sum the elements to the resulting feature map.
The size of the feature map \(n_{out}\) depends on the size of the input \(n_{in}\), stride \(s\), padding \(p\), and kernel \(k\). In the example, \(n_{in}\) = 4 reduced to \(n_{out}\) = 2, but typically we want \(n_{in}\) = \(n_{out}\) as you’ll see later.
\[n_{out}=\dfrac{n_{in} + 2p - k}{s} + 1\]
This is a very basic example, but it highlights the intuition behind convolution. We’re essentially trying to exaggerate a feature we’re looking for in an image. In this case, the vertical edge became more pronounced in the output image.
In computer vision, this process is used for image filtering and there are a variety of filters used for specific purposes such as edge detection and noise removal. When we’re dealing with image classification, it would be difficult to come up with the values of a filter for picking out parts of a body such as hair. Furthermore, hair can come in many different shapes and colors. This is where convolution in neural networks excel at. We can have the neural network come up with the best values (weights) for a filter automatically with a convolutional layer. Additionally, we can apply as many filters as we want at each layer.
More on ConvNets
Another key component in convolutional neural networks is the pooling layer. Generally, the location of a feature is only significant when accounting for neighboring features. For example, the defining features of a face are eyes, nose, and mouth. In an image, these features need to be near each other to define a face. However, the location of the face in the image is insignifcant. Max pooling is a common technique that can be applied periodically throughout the network to downsample (also known as subsampling), reducing the number of parameters in the network while being locally shift invariant. In max pooling, the largest value that lies within a filter is kept.
For the sake of brevity, I will only explain the code segments that are relavent towards the topic and will not re-explain code that has already been shown. The notebook is available Here
Data
Normalizing the image data reduces the chance of vanishing and exploding gradients for the optimizer during training. To do this, calculate the mean \(\mu\) and standard deviation \(\sigma\) of each channel for every image in the dataset, then subtract each value by its respective mean and divide by its standard deviation.
\[Z = \dfrac{x - \mu}{\sigma}\]
PyTorch does the Normalization part for us
Since the number of images for schooner class is dramatically smaller than the other two classes in the dataset, I computed some weights to apply for the loss criterion during training by taking the difference of each portion by 1.
Output
The next step is to apply the transformations (Resizing each image to the same size and normalizing) and split the dataset. I originally planned on doing 60/20/20 split for a train, validation, and test set. However, I found that a 20/20 split would be too small and since I was testing a variety of different networks in one go, I wouldn’t need to do much tuning. Ultimately, I decided on a 60/40 split and commented out the former.
My attempt at CNNs
The way this network is structured can be explained by gen_layers. It takes in num_layers(the number of convolutional layers), and expansion which is a lambda expression used for expanding the number of feature maps from each conv layer. Each conv layer is followed by ReLU activation function, BatchNorm, and MaxPool. The sequence is stored in self.net on construction.
classifier contains the fully connected layers that result in the model’s prediction and is generated on construction.
The forward pass throws the input through the sequence of conv layers in self.net, then passes the output from the self.net to the sequence of fully connected layers from classifier to get the prediction.
Notice
This expression represents the number of neurons in each feature map multiplied by the number of feature maps at the end of gen_layers. Recall that maintaining dimensionality of the feature maps \(n_{in}\) = \(n_{out}\) makes computing this value convenient.
Metrics for training
This function scores the percent accuracy and number of correct classifications.
Training loop
This is the standard training loop for a PyTorch model. Only things noteworthy here are the criterion and optimizer.
Notice
I used the class weights I calculated earlier with the Cross Entropy Loss function. For the optimizer, I used Stochastic Gradient Descent with a learning rate specific to each model.
Test out CNNs
In the above code segment, I’m evaluating different CNN architectures based on number of convolutional layers and number of feature maps from each layer.
Plotting losses and accuracies
Output
What I learned
Expanding the number of feature maps at each conv layer doesn’t seem to have any meaningful influence on the performance. It does appear to perform better the deeper the network gets, the last three runs with four conv layers all performed > 99.0%. I was satisfied with the results, with the best conv net having a score of 99.699% correct.
I also tried implementing a few popular ConvNet architectures that interested me for fun.
Conv layers use 3x3 kernel with the exception of C
C uses 1x1 kernel at the end of the last three sections
Padding used on convolutions to maintain dimensionality of input
Maxpool between each section of conv layers uses 2x2 for kernel and stride
Maxpool is followed by ReLU
The original VGG didn’t include batchnorm operations in the model, so that is something I included. Another thing to note is the image size they used. They used 224x224, however, my images will remain 256x256. I will not be experimenting with A-LRN, as the paper concluded that it did not improve performance and only increased memory usage and computation time.
I didn’t include softmax on the output because I’m using cross entropy loss (CEL) as my loss function. Generally, the combination of both would introduce vanishing gradient problem because PyTorch incorporates softmax into CEL by default. See torch.nn.CrossEntropyLoss
I had orignally included it to see what my results would look like. The loss struggled to decrease and the model wouldn’t improve.
Model
The model is very basic and structured similarly to how I designed my vanilla CNN architecture. The main distinction being the number of conv layers used. Also, a conv layer will expand the number of feature maps less frequently and only by a factor of 2 starting from 64.
Evaluation
Looking at the accuracy for the validation set, there are massive dips in some of the models. If you look closely, you can correspond these dips to an increase on the loss curve at the same epoch, which would explain the drop in accuracy.
The changes I made from the original VGG net architecture can pratically be applied here. Original GoogLeNet didn’t use batchnorms, it used softmax, and trained on 224x224 images.
Same sequences of operations following a convolution as the other networks, but this time I made it into its own class to be applied repeatedly throughout the network.
Inception block
This block is responsible for deciding which kernel sizes contribute the most on a given input in the network. Since using 3x3 and 5x5 filters can be computationally expensive, they apply 1x1 filters before them to reduce the number of input features for those conv layers.
Model
I constructed the layers for the network using the same values in the same order from the table above.
Evaluation
Output
References
[1] Simonyan, Karen, and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” ArXiv.org, 10 Apr. 2015, https://arxiv.org/abs/1409.1556.