Generating Decision Trees

Decision trees are a very useful method of visualising data. They find a role in analysing algorithms (there's a very nice proof of sorting complexity being Ω(NlgN) based on decision trees).
Decision trees can also be seen as a solution to the classic problem where one tries to guess an object that his friend is thinking about using a maximum of 20 questions.

Here I have tried to generate a decision tree that is generated so that the information content is maximised.
  • Entropy => The measure of randomness/messiness/clutter in the data. We try to reduce this.
  • Information gain => difference between original entropy and new entrpy. We try to maximise this
Shannon's Entropy:
Information of xi: l(xi) = log(xi)/log2 (i.e. Log to base 2)
Shannon Entropy = Sum over all i {probability(xi)*l(xi)}

We try to make splits in our dataset, and always aim at making a split so that we maximise our information gain. This is done by the method chooseBestFeatureToSplit() in the code.

Since there is no way to foretell which feature would be the best choice to split on, we have to try every feature and return the one which minimises the entropy.

Once we are able to identify such a feature we can build the tree by recursively dividing the dataset.
Recursion base case: All elements are in the same class (leaf node), or there is no feature to split on (take a majority vote)

The code is available via BitBucket

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: