In this blog, I will be comparing a SVM image classifier with a more modern CNN model. I’ll be using the wild-animals-images dataset throughout the blog.
Support Vector Machine (SVM) is one of the old school methods used for image classification problems. It essentially plots the input features in a n-dimensional space for two classes and attempts to separate the feature data by finding the hyperplane that best isolates the differences in the data.
Kernel Trick
Not all data classes are linearly separable, which is why we typically transform the data into higher dimensional space in such a way so that it becomes linearly separable. We can use a Non-linear Mapping using a kernel to transform the data.
Ex:
Linear Decision Boundary:
\[\vec{w} \cdot \vec{x} + b = 0\]
Non-linear Decision Boundary mapping 2d to 3d space:
The best hyperplane (decision boundary) is the one that represents the largest margin between the two classes. Sample data points closest to the hyperplane are called the support vectors and are the data points that are chosen for determining the hyperplane. SVM attempts to find the hyperplane with the maximum distance between the support vectors (maximizing the margin).
If you would like to understand more about the learning problem for SVM, I suggest looking at [1] Andrew Nguyen’s notes on SVM.
When using SVM, one thing you really need to consider is the size of the feature input. We don’t want to have an absurdly large feature vector because it would require a crazy amount of time to train. Also, the number of samples from the datset has to be kept relatively small because it can influence the time complexity by a power of 3.
\[\text{Training time complexity: } \quad O(n_{features} \times n_{samples}^3)\]
I resized the images to be 64x64x3 and flattened them to obtain the feature vector. I’m using the values from the image as the feature input. I have seen more advance techniques used to genereate feature values such as Histogram of Gradients (HOG), but I will be using the more simple approach.
In the above code snippet, I did (60/40) split to get a train and test set. For the training data, I did an additional split. The first splits are for seeing which hyperparameters work the best and the second is used to continue training on the rest of the samples with the best SVM. Recall that adding more samples during training will significantly impact training time.
Creating SVM
Scikit-learn offers GridSearchCV which is a useful tool for testing out different hyperparameters in one go. Here you can see that I take advantage of that and test out different gamma, kernel, and C values on the SVM. Parameter C trades off miss-classification against generalizing the decision boundary. The higher the value, the more that it aims to classify the training data correctly. Gamma influences the amount a training sample has, so a higher value would mean that other samples will have to be closer in order to be affected.
Best Hyperparameters
Now we can figure out which parameters performed the best.
Output
Looks like an SVM with C=10, gamma=0.001, and radial basis function kernel performed the best on the dataset.
Fitting the remaining data
Metrics for best SVM
Using the best SVM, we evaluate the performance of the model. I’ll be using the predicted probabilites of the test data with the actual values to obtain the values needed to create an AUC for the ROC curve.
Evaluation with ROC Curve
An ROC curve (receive operating characteristic curve) graphs the performance of a classifier model at all classification thresholds using the True Positive Rate and False Positive Rate.
The ROC represents the degrees of separability, in other words, its a measure for how well the model distinguishes the classes. The AUC represents how often the model predicts the class as correctly right or wrong. This metric will give us more insight into how the model distinguishes the classes. I chose this because with SVM, this would be a better metric because we’re comparing pixel values as opposed to features in the image.
Note that an AUC value < 0.5 would mean that the model is not distinguishing or performing worse than chance. I expected to see an AUC around 0.7-0.8 because of how I defined the feature vector for the model. Also, six classes makes it more difficult to build a solid classifier with SVM just because of the nature of how SVM handles multi-class classification.
ResNet is a Type of CNN that uses a residual learning framework to facilitate the training on substantially deeper networks. It hypothesizes that it is easier to optimize a residual mapping as opposed to the unreferenced mapping of the stacked layers.
I splitted the data (60/40) for train set and test set.
ConvBlock
In the paper, it follows every convolution by a batch norm so I made it its own module.
Shortcut connections with Residual Block
The residual block uses the identity mapping, meaning the outputs from the shorcut are added to the outputs of the stacked layer. Downsampling is performed on the identity to get the output of the shortcut and then it is added to the output of the other layers.
Putting it all together
Following the table from above, I constructed the architecture for the ResNet152 the paper claimed performed the best.
Notice that I’m expanding the down_sample ConvBlock by 4 at each residual block and expanding each residual block by 2.
Training Loop
Testing the impact of normalization
Normalizing the image before training creates more stability for the optimizer during training. Theoretically, we should arrive at a better accuracy with the normalized dataset given the same number of epochs. I’ll be experimenting with both to see which one performs the best.
Notice that I normalized the dataset based on the values for each class rather than the values of the dataset as a whole.
Showing the effects of normalization on an image.
Running the model on the dataset with and without normalization.
In the paper, SGD with weight_decay=0.0001 and momentum=0.9 is used for the optimizer. I used Binary Cross Entropy loss for training as that is the standard criterion for image classification with CNNs.
Evaluation
No Normalization
Output
With Normalization
Output
Looking at the test accuracy curves, you may notice that the dips on the data without normalization are much greater than the dips with normalization. This shows that the normalization technique does provide some stability, and it also looks like the final accuracy was slightly better by roughly 2%. Contrasting the train and test accuracies, it appears that normalization generalizes better.
Closing Thoughts
There are a few things that can be compared when looking at the different types of models used for the image classification.
One thing I did not mention was the amount of time spent for training each. With the SVM, it took my machine roughly 40 minutes to train the SVM with the different hyperparameters, which made it easier to adjust and play around with. With ResNet, It took me nearly two hours to train both models, so I couldn’t really change much with the model as it would have taken far too long. Also, I’m sure I could have improved both of the models by a relatively considerable amount if I had used more features for the SVM or iterations for the CNN.
An advantage that the SVM had was that it was easier to set up and toy around with whereas I struggled quite a bit when implementing the ResNet model. The downside is that it doesn’t classify as well as ResNet.