The classification problem that an SVM solves is optimizing a convex function. This guarantees that the algorithm reaches a global optima, as opposed to neural networks, where the algorithm often converges to local optima (usually very good local optima).
The core component of any SVM based classifier is the kernel. Kernels are simple functions that can map an input example into a set of numbers. There are different types of kernels but the most widely used ones are:
A sample implementation of the Gaussian Kernel is as follows:
https://github.com/bhsaurabh/multiclassSVM/blob/master/OCR/gaussianKernel.m
Before applying the kernel function to an input, it is generally a good idea to normalize the input features. The following techniques can be implemented:
I have omitted feature scaling in the example mentioned because I had several features with a standard deviation of 0, giving me a NaN for many of my new features.
TRAINING THE CLASSIFIERS
Now SVMs are essentially great binary classifiers. However in a problem where one has to predict digits, there are 10 classes (0 – 9) to classify into. A great way to approach this problem is to use the one-vs-all method. Here we build 10 different SVM classifiers where each classifier performs binary classification. For instance, the 1st classifier predicts whether an input is a 1 or not, the 2nd classifier tells if an input is a 2 or not and so on.
For the digit classification problem I had 10 different SVM classifiers, each trained by the SMO algorithm (http://en.wikipedia.org/wiki/Sequential_minimal_optimization).
Input: 5000 training examples, each having a 20×20 image of a handwritten number. This set is divided into 60% training examples, 20% cross validation examples and 20% test examples. The cross validation examples are used to automatically derive parameters for best predictions with the SVMs. The test set tests the final accuracy.
Now each image is a 20×20 image, which means there are 400 pixels (or 400 features) per image. Since many of these pixels would be the same value across all images it is a good idea to perform dimensionality reduction using a compression algorithm like PCA (Principal Component Analysis). This will provide a definite speed gain in the classification.
The results array is created appropriately. For instance when the output in the training sample is 1, the array sent is [1,0,0,0,0,0,0,0,0,0], when it is a 2 the array sent is [0,1,0,0,0,0,0,0,0,0] and so on. All of these arrays are rows in a Results matrix.
When training the ith classifier, the ith column of this Results matrix is sent to the SVM. This is shown in :
https://github.com/bhsaurabh/multiclassSVM/blob/master/OCR/makeClasses.m
https://github.com/bhsaurabh/multiclassSVM/blob/master/OCR/ocr.m
The 10 SVMs are trained using this data.
PREDICTION:
To predict the outcome of any input sample provided, all SVMs are used to classify and the result from the most confident SVM is chosen and displayed.
The confidence of an SVM is measured using the formula:
confidence = (theta)’ * (inputX) where theta: set of parameters to train the SVM and inputX: column vector of an input sample onto which the kernel function has been applied
The above mentioned algorithm provides an accuracy of 79.83 % on the 2000 test samples. However this is when the automatic parameter selection is turned off (as it is a slow process). With automatic parameter selection enabled, the accuracy is bound to improve.
Finished Reading:
Rebel Code … by Glyn Moody
Reviews:
A great book if one is passionate about the beginnings of the open source revolution. Very inspiring and definitely worth being read
Currently reading…
Rebel Code: Linux & the Open Source Revolution (Glyn Moody)
Do check the source at Bitbucket
Entropy => The measure of randomness/messiness/clutter in the data. We try to reduce this.
Information gain => difference between original entropy and new entrpy. We try to maximise this
Shannon's Entropy: Information of xi: l(xi) = log(xi)/log2 (i.e. Log to base 2) Shannon Entropy = Sum over all i {probability(xi)*l(xi)} We try to make splits in our dataset, and always aim at making a split so that we maximise our information gain. This is done by the method chooseBestFeatureToSplit() in the code. Since there is no way to foretell which feature would be the best choice to split on, we have to try every feature and return the one which minimises the entropy. Once we are able to identify such a feature we can build the tree by recursively dividing the dataset. Recursion base case: All elements are in the same class (leaf node), or there is no feature to split on (take a majority vote) The code is available via BitBucket
I have recently been very interested towards learning machine learning and have picked up a book to learn more about this topic.
The language used is Python, and NumPy is used for scientific calculations. I find this great, as I am pretty familiar with Python, and this means that I do not really have to spend a lot of time learning a new language.
Machine learning problems can be sub-divided into 2 categories:
Supervised Learning – Given a dataset, predict/foretell certain values
Classification: Classify a problem into 1 of a fixed number of classes/types
Regression: Output a value in a given range
Unsupervised Learning – There is no dataset involved here
Clustering: Adjust your data into groups
Density Estimation: Get the probability of your data belonging to a group
The algorithm used for the OCR code is the k-nearest neighbors algorithm.
The steps involved are:
Get distances of your required value from dataSet members
Sort the distances in ascending order and choose the k smallest distances & corresponding dataset entries
Classify the dataset entries and return the class that contains the majority of the dataset entries.
Given below is a look into an application of the k-Nearest Neighbors classification algorithm – recognition of hand-written digits (0 – 9)
I was able to achieve an error rate of 0.0124 which is pretty good for a 1^{st} machine learning algo.
The script attached does the following:
Hand-written images are stores in arrays of size 32×32. The array has 0s and 1s as contents
Our classifier accepts an input vector of size 1024. The image is converted into this
Images from the training dataset are taken and the dataSet matrix is created and passed
Images from the test data set are used for testing.
The script is uploaded on Bitbucket
HTCondor makes use of all available computational resources to get a job done. Computational resources could be processors in a single machine or a distributed computing system. It allocates jobs to different machines based on rules and has the ability to transfer jobs from 1 machine to another.
Condor generally finds its applications in servers/distributed networks/farms. However it can be used to execute jobs in parallel on a single machine too.
1. Install condor
sudo apt-get install condor
2. Run condor_status to see your processors
condor_status -available -> shows available procs
condor_status -run -> shows procs that are running a job
This service provides every user with a dedicated Ubuntu VM with sudo access. I generally develop on this for the following reasons:
1. My work can be accessed from any place
2. Ubuntu provides apt, which is a superb package manager (combine this with the sudo access, and you know where I am going)
Net connection at my home is bad. So I can use koding to install modules onto itself, and that works really quick.
By the way, this is no advertisement
Used the itertools module with generators to implement the program…