I had recently been looking into Support Vector Machines as a way of performing classification. SVMs are great algorithms for binary classification, as they are fast,and can come up with complex non-linear decision boundaries.
The classification problem that an SVM solves is optimizing a convex function. This guarantees that the algorithm reaches a global optima, as opposed to neural networks, where the algorithm often converges to local optima (usually very good local optima).
The core component of any SVM based classifier is the kernel. Kernels are simple functions that can map an input example into a set of numbers. There are different types of kernels but the most widely used ones are:
- Linear Kernel (no kernel) – perform no change on the input features…very similar to traditional Logistic regression
- Gaussian Kernel – This is great when there are a lot more training examples as opposed to number of feature
A sample implementation of the Gaussian Kernel is as follows:
Before applying the kernel function to an input, it is generally a good idea to normalize the input features. The following techniques can be implemented:
- Mean normalization (Replace xj with xj – (mean of feature)) -> Makes the new mean 0
- Feature scaling -> causes new features to range from 0 to 1. This can be done by dividing the features(s) by the standard deviation or by max(feature) – min(feature).
I have omitted feature scaling in the example mentioned because I had several features with a standard deviation of 0, giving me a NaN for many of my new features.
TRAINING THE CLASSIFIERS
Now SVMs are essentially great binary classifiers. However in a problem where one has to predict digits, there are 10 classes (0 – 9) to classify into. A great way to approach this problem is to use the one-vs-all method. Here we build 10 different SVM classifiers where each classifier performs binary classification. For instance, the 1st classifier predicts whether an input is a 1 or not, the 2nd classifier tells if an input is a 2 or not and so on.
For the digit classification problem I had 10 different SVM classifiers, each trained by the SMO algorithm (http://en.wikipedia.org/wiki/Sequential_minimal_optimization).
Input: 5000 training examples, each having a 20×20 image of a handwritten number. This set is divided into 60% training examples, 20% cross validation examples and 20% test examples. The cross validation examples are used to automatically derive parameters for best predictions with the SVMs. The test set tests the final accuracy.
Now each image is a 20×20 image, which means there are 400 pixels (or 400 features) per image. Since many of these pixels would be the same value across all images it is a good idea to perform dimensionality reduction using a compression algorithm like PCA (Principal Component Analysis). This will provide a definite speed gain in the classification.
The results array is created appropriately. For instance when the output in the training sample is 1, the array sent is [1,0,0,0,0,0,0,0,0,0], when it is a 2 the array sent is [0,1,0,0,0,0,0,0,0,0] and so on. All of these arrays are rows in a Results matrix.
When training the ith classifier, the ith column of this Results matrix is sent to the SVM. This is shown in :
The 10 SVMs are trained using this data.
To predict the outcome of any input sample provided, all SVMs are used to classify and the result from the most confident SVM is chosen and displayed.
The confidence of an SVM is measured using the formula:
confidence = (theta)’ * (inputX) where theta: set of parameters to train the SVM and inputX: column vector of an input sample onto which the kernel function has been applied
The above mentioned algorithm provides an accuracy of 79.83 % on the 2000 test samples. However this is when the automatic parameter selection is turned off (as it is a slow process). With automatic parameter selection enabled, the accuracy is bound to improve.