Multi class classification with Support Vector Machines

I had recently been looking into Support Vector Machines as a way of performing classification. SVMs are great algorithms for binary classification, as they are fast,and can come up with complex non-linear decision boundaries.

The classification problem that an SVM solves is optimizing a convex function. This guarantees that the algorithm reaches a global optima, as opposed to neural networks, where the algorithm often converges to local optima (usually very good local optima).

The core component of any SVM based classifier is the kernel. Kernels are simple functions that can map an input example into a set of numbers. There are different types of kernels but the most widely used ones are:

  1. Linear Kernel (no kernel) – perform no change on the input features…very similar to traditional Logistic regression
  2. Gaussian Kernel – This is great when there are a lot more training examples as opposed to number of feature

A sample implementation of the Gaussian Kernel is as follows:

https://github.com/bhsaurabh/multiclassSVM/blob/master/OCR/gaussianKernel.m

Before applying the kernel function to an input, it is generally a good idea to normalize the input features. The following techniques can be implemented:

  • Mean normalization (Replace xj with xj – (mean of feature)) -> Makes the new mean 0
  • Feature scaling -> causes new features to range from 0 to 1. This can be done by dividing the features(s) by the standard deviation or by max(feature) – min(feature).

I have omitted feature scaling in the example mentioned because I had several features with a standard deviation of 0, giving me a NaN for many of my new features.

TRAINING THE CLASSIFIERS

Now SVMs are essentially great binary classifiers. However in a problem where one has to predict digits, there are 10 classes (0 – 9) to classify into. A great way to approach this problem is to use the one-vs-all method. Here we build 10 different SVM classifiers where each classifier performs binary classification. For instance, the 1st classifier predicts whether an input is a 1 or not, the 2nd classifier tells if an input is a 2 or not and so on.

For the digit classification problem I had 10 different SVM classifiers, each trained by the SMO algorithm (http://en.wikipedia.org/wiki/Sequential_minimal_optimization).

Input: 5000 training examples, each having a 20×20 image of a handwritten number. This set is divided into 60% training examples, 20% cross validation examples and 20% test examples. The cross validation examples are used to automatically derive parameters for best predictions with the SVMs. The test set tests the final accuracy.

Now each image is a 20×20 image, which means there are 400 pixels (or 400 features) per image. Since many of these pixels would be the same value across all images it is a good idea to perform dimensionality reduction using a compression algorithm like PCA (Principal Component Analysis). This will provide a definite speed gain in the classification.

The results array is created appropriately. For instance when the output in the training sample is 1, the array sent is [1,0,0,0,0,0,0,0,0,0], when it is a 2 the array sent is  [0,1,0,0,0,0,0,0,0,0] and so on. All of these arrays are rows in a Results matrix.

When training the ith classifier, the ith column of this Results matrix is sent to the SVM. This is shown in :

https://github.com/bhsaurabh/multiclassSVM/blob/master/OCR/makeClasses.m

https://github.com/bhsaurabh/multiclassSVM/blob/master/OCR/ocr.m

The 10 SVMs are trained using this data.

PREDICTION:

To predict the outcome of any input sample provided, all SVMs are used to classify and the result from the most confident SVM is chosen and displayed.

The confidence of an SVM is measured using the formula:

confidence = (theta)’ * (inputX) where theta: set of parameters to train the SVM and inputX: column vector of an input sample onto which the kernel function has been applied

The above mentioned algorithm provides an accuracy of 79.83 % on the 2000 test samples. However this is when the automatic parameter selection is turned off (as it is a slow process). With automatic parameter selection enabled, the accuracy is bound to improve.

Finished Reading:
Rebel Code … by Glyn Moody

Reviews:

A great book if one is passionate about the beginnings of the open source revolution. Very inspiring and definitely worth being read

Finished Readin…

Currently reading…
Rebel Code: Linux & the Open Source Revolution (Glyn Moody)

Currently readi…

Prescribing Lenses

I have posted here a program that prescribes different lenses, based on the test vector provided. It uses a decision tree to make the classification. The training data set is present in lenses.txt, while the decision tree code does the majority of the work.

Do check the source at Bitbucket

Tagged , , ,

Classification with Decision trees

Now that we have a decision tree constructed, we can use it for classification problems. 
The classification method is a recursive one which parses the dictionary and halts (& returns) when a leaf node is encountered.
In such problems, the vast chunk of processing time goes into building the decision tree.
It would be very inconvenient if the decision tree is to re-built every time we classify a problem.
It makes a lot of sense to serialise the tree, and we use the Python pickle module to do so.
The code is available via Bitbucket

Generating Decision Trees

Decision trees are a very useful method of visualising data. They find a role in analysing algorithms (there's a very nice proof of sorting complexity being Ω(NlgN) based on decision trees).
Decision trees can also be seen as a solution to the classic problem where one tries to guess an object that his friend is thinking about using a maximum of 20 questions.

Here I have tried to generate a decision tree that is generated so that the information content is maximised.
  • Entropy => The measure of randomness/messiness/clutter in the data. We try to reduce this.
  • Information gain => difference between original entropy and new entrpy. We try to maximise this
Shannon's Entropy:
Information of xi: l(xi) = log(xi)/log2 (i.e. Log to base 2)
Shannon Entropy = Sum over all i {probability(xi)*l(xi)}

We try to make splits in our dataset, and always aim at making a split so that we maximise our information gain. This is done by the method chooseBestFeatureToSplit() in the code.

Since there is no way to foretell which feature would be the best choice to split on, we have to try every feature and return the one which minimises the entropy.

Once we are able to identify such a feature we can build the tree by recursively dividing the dataset.
Recursion base case: All elements are in the same class (leaf node), or there is no feature to split on (take a majority vote)

The code is available via BitBucket

Hand writing recognition

K-nearest neighbors

 

I have recently been very interested towards learning machine learning and have picked up a book to learn more about this topic.

The language used is Python, and NumPy is used for scientific calculations. I find this great, as I am pretty familiar with Python, and this means that I do not really have to spend a lot of time learning a new language.

 

Machine learning problems can be sub-divided into 2 categories:

  1. Supervised Learning – Given a dataset, predict/foretell certain values

    1. Classification: Classify a problem into 1 of a fixed number of classes/types

    2. Regression: Output a value in a given range

  1. Unsupervised Learning – There is no dataset involved here

    1. Clustering: Adjust your data into groups

    2. Density Estimation: Get the probability of your data belonging to a group

 The algorithm used for the OCR code is the k-nearest neighbors algorithm.

The steps involved are:

  1. Get distances of your required value from dataSet members

  2. Sort the distances in ascending order and choose the k smallest distances & corresponding dataset entries

  3. Classify the dataset entries and return the class that contains the majority of the dataset entries.

 

Given below is a look into an application of the k-Nearest Neighbors classification algorithm – recognition of hand-written digits (0 – 9)

I was able to achieve an error rate of 0.0124 which is pretty good for a 1st machine learning algo.

 

The script attached does the following:

  1. Hand-written images are stores in arrays of size 32×32. The array has 0s and 1s as contents

  2. Our classifier accepts an input vector of size 1024. The image is converted into this

  3. Images from the training dataset are taken and the dataSet matrix is created and passed

  4. Images from the test data set are used for testing.

 

The script is uploaded on Bitbucket

 

 

 

Tagged ,

Parallelism & HTCondor

I recently ran into HTCondor, which is an implementation of High Throughput Computing (HTC). It is developed and maintained by the University of Wisconsin – Madison.

HTCondor makes use of all available computational resources to get a job done. Computational resources could be processors in a single machine or a distributed computing system. It allocates jobs to different machines based on rules and has the ability to transfer jobs from 1 machine to another.

Condor generally finds its applications in servers/distributed networks/farms. However it can be used to execute jobs in parallel on a single machine too.

1. Install condor

sudo apt-get install condor

2. Run condor_status to see your processors

condor_status -available -> shows available procs

condor_status -run -> shows procs that are running a job

Condor takes an SDF file as an input. If the various tasks you are running have dependencies, you can use DAGman. DAGman uses a directional acyclic graph to inform condor about the various dependencies. In all, it takes care of generating the SDF file for you (given a graph).
A sample sdf file can be found on the HTCondor web documentation.
My SDF file can be found below:

This job can be submitted to condor, which will allocate it to a processor
Lets say, I submit the job thrice:

Now a condor_status -run will show list of procs running a condor job.

Kodixg for Online Linux

I use Linux for a lot of things, and depend the command line to a great extent. Very recently I started using http://www.koding.com

This service provides every user with a dedicated Ubuntu VM with sudo access. I generally develop on this for the following reasons:

1. My work can be accessed from any place

2. Ubuntu provides apt, which is a superb package manager (combine this with the sudo access, and you know where I am going)

Net connection at my home is bad. So I can use koding to install modules onto itself, and that works really quick.

By the way, this is no advertisement :)

itertools with generators

Made a small script to find all words after a certain position in a file.

Used the itertools module with generators to implement the program…

Link to code

Follow

Get every new post delivered to your Inbox.