Monthly Archives: February 2014

Currently reading…
Rebel Code: Linux & the Open Source Revolution (Glyn Moody)

Currently readi…


Prescribing Lenses

I have posted here a program that prescribes different lenses, based on the test vector provided. It uses a decision tree to make the classification. The training data set is present in lenses.txt, while the decision tree code does the majority of the work.

Do check the source at Bitbucket

Tagged , , ,

Classification with Decision trees

Now that we have a decision tree constructed, we can use it for classification problems. 
The classification method is a recursive one which parses the dictionary and halts (& returns) when a leaf node is encountered.
In such problems, the vast chunk of processing time goes into building the decision tree.
It would be very inconvenient if the decision tree is to re-built every time we classify a problem.
It makes a lot of sense to serialise the tree, and we use the Python pickle module to do so.
The code is available via Bitbucket

Generating Decision Trees

Decision trees are a very useful method of visualising data. They find a role in analysing algorithms (there's a very nice proof of sorting complexity being Ω(NlgN) based on decision trees).
Decision trees can also be seen as a solution to the classic problem where one tries to guess an object that his friend is thinking about using a maximum of 20 questions.

Here I have tried to generate a decision tree that is generated so that the information content is maximised.
  • Entropy => The measure of randomness/messiness/clutter in the data. We try to reduce this.
  • Information gain => difference between original entropy and new entrpy. We try to maximise this
Shannon's Entropy:
Information of xi: l(xi) = log(xi)/log2 (i.e. Log to base 2)
Shannon Entropy = Sum over all i {probability(xi)*l(xi)}

We try to make splits in our dataset, and always aim at making a split so that we maximise our information gain. This is done by the method chooseBestFeatureToSplit() in the code.

Since there is no way to foretell which feature would be the best choice to split on, we have to try every feature and return the one which minimises the entropy.

Once we are able to identify such a feature we can build the tree by recursively dividing the dataset.
Recursion base case: All elements are in the same class (leaf node), or there is no feature to split on (take a majority vote)

The code is available via BitBucket

Hand writing recognition

K-nearest neighbors


I have recently been very interested towards learning machine learning and have picked up a book to learn more about this topic.

The language used is Python, and NumPy is used for scientific calculations. I find this great, as I am pretty familiar with Python, and this means that I do not really have to spend a lot of time learning a new language.


Machine learning problems can be sub-divided into 2 categories:

  1. Supervised Learning – Given a dataset, predict/foretell certain values

    1. Classification: Classify a problem into 1 of a fixed number of classes/types

    2. Regression: Output a value in a given range

  1. Unsupervised Learning – There is no dataset involved here

    1. Clustering: Adjust your data into groups

    2. Density Estimation: Get the probability of your data belonging to a group

 The algorithm used for the OCR code is the k-nearest neighbors algorithm.

The steps involved are:

  1. Get distances of your required value from dataSet members

  2. Sort the distances in ascending order and choose the k smallest distances & corresponding dataset entries

  3. Classify the dataset entries and return the class that contains the majority of the dataset entries.


Given below is a look into an application of the k-Nearest Neighbors classification algorithm – recognition of hand-written digits (0 – 9)

I was able to achieve an error rate of 0.0124 which is pretty good for a 1st machine learning algo.


The script attached does the following:

  1. Hand-written images are stores in arrays of size 32×32. The array has 0s and 1s as contents

  2. Our classifier accepts an input vector of size 1024. The image is converted into this

  3. Images from the training dataset are taken and the dataSet matrix is created and passed

  4. Images from the test data set are used for testing.


The script is uploaded on Bitbucket




Tagged ,