Introduction
Decision tree learning is a method for assessing the most likely outcome value by taking into account the known values of the stored data instances. This learning method is among the most popular of inductive inference algorithms and has been successfully applied in broad range of tasks such as assessing the credit risk of applicants and improving loyality of regular customers.
Information, Entropy and Information Gain
There are several software tools already available which implement various algorithms for constructing decision trees which optimally fit the input dataset. Nevertheless it is useful to clarify the underlying principle.
Information
First let's briefly discuss the concept of information. In simplified form we could say that the information is a measure of uncertainty or surprise; the more we are surprised about the news the bigger is the information. If there was a newspaper which pubblished a news about the morning sunrise certainly none of the readers would be interested in such story because it doesn't hold any information - the sunrise is a fact and we know it will happen tomorrow again. On the other hand if a story in a newspaper is about something noone expected it's a big news. An example is Michael Jackson's death. It happened during the time of one of the worst recessions which affected the whole world but according to the statistics the news about Michael Jackson were far more interesting than the economic facts of the time.
EntropyIn General entropy is the measure of the uncertainty associated with random variable. In case of stored instances entropy is the measure of the uncertainty associated with the selected attribute.
where is the probability mass function of outcome .Information Gain
Information gain is the expected reduction in entropy caused by partitioning the examples according to the attribute. In machine learning this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of the selected attribute. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) forms a decision tree. Usually an attribute with high information gain should be preferred to other attributes.
An ExampleThe process of decision tree construction is described by the following example: There are 14 instances stored in the database described with 6 attributes: day, outlook, temperature, humidity, wind and playTennis. Each instance describes the facts of the day and the action of the observed person (played or not played tennis). Based on the given record we can assess which factors affected the person's decision about playing tennis.
For the root element an attribute with the highest entropy is selected. In the case of this example it is the outlook because it is the most uncertain parameter.
After the root parameter is selected the information gain for every possible combination of pair is calculated and then the branches are selected upon the information gain value; the higher is the information gain the more reliable the branch is.
Entropy(S) is the entropy of classificator PlayTennis - there are 14 instances stored in the dataset from which a person decides do play tennis 9 times (9+) and not to play 5 times (5-).
The above example shows the calculation of the information gain value of classificator Wind: There are 8 instances of value Weak. In case of weak wind a person decides to play tennis 6 times and not to play 2 times [6+, 2-]. In case of strong wind (6 instances) the user decides to play 3 times and not to play 3 times as well. Considering the given instances the calculated value of information gain of attribute wind is 0.48.
Information gain values of all attributes are calculated in the same way and the results are the following:
After the root parameter is selected the information gain for every possible combination of pair is calculated and then the branches are selected upon the information gain value; the higher is the information gain the more reliable the branch is.
Entropy(S) is the entropy of classificator PlayTennis - there are 14 instances stored in the dataset from which a person decides do play tennis 9 times (9+) and not to play 5 times (5-).
The above example shows the calculation of the information gain value of classificator Wind: There are 8 instances of value Weak. In case of weak wind a person decides to play tennis 6 times and not to play 2 times [6+, 2-]. In case of strong wind (6 instances) the user decides to play 3 times and not to play 3 times as well. Considering the given instances the calculated value of information gain of attribute wind is 0.48.
Information gain values of all attributes are calculated in the same way and the results are the following:
Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029
Using WEKA
The example is taken from the Book Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
The .arff file
One of possible options for importing the record in WEKA tool is writing records in .arff file format. In this example the file looks like the following:
@relation PlayTennis
@attribute day numeric
@attribute outlook {Sunny, Overcast, Rain}
@attribute temperature {Hot, Mild, Cool}
@attribute humidity {High, Normal}
@attribute wind {Weak, Strong}
@attribute playTennis {Yes, No}
@data
1,Sunny,Hot,High,Weak,No,?
2,Sunny,Hot,High,Strong,No,?
3,Overcast,Hot,High,Weak,Yes,?
4,Rain,Mild,High,Weak,Yes,?
5,Rain,Cool,Normal,Weak,Yes,?
6,Rain,Cool,Normal,Strong,No,?
7,Overcast,Cool,Normal,Strong,Yes,?
8,Sunny,Mild,High,Weak,No,?
9,Sunny,Cool,Normal,Weak,Yes,?
10,Rain,Mild,Normal,Weak,Yes,?
11,Sunny,Mild,Normal,Strong,Yes,?
12,Overcast,Mild,High,Strong,Yes,?
13,Overcast,Hot,Normal,Weak,Yes,?
14,Rain,Mild,High,Strong,No,?
The file can then be imported using WEKA explorer. When the above file is imported the interface looks like the following:
after the data is imported there is a set of classification methods available under the classification tab. In this case the c4.5 algorithm has been chosen which is entitled as j48 in Java and can be selected by clicking the button choose and select trees->j48.
The tree is then created by selecting start and can be displayed by selecting the output from the result list with the right-mouse button choosing the option "Visualize Tree".
Figure: The decision tree constructed by using the implemented C4.5 algorithm
References
Data Mining and KnowledgeDiscovery; Prof. Dr. Nada Lavrač
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition; Ian H. Witten, Eibe Frank
The algorithm implemented in WEKA constructs the tree which is consistent with the information gain values calculated above: the attribute Outlook has the highest information gain thus it is the root attribute. Attributes Humidity and Wind have lower information gain than Outlook and higher than Temperature and thus are placed below Outlook. The information gain of a specific branch can be calculated the same way as the example above. For example if we were interested what is the information gain of branch (Outlook, Temperature) the attributes Outlook and Temperature should be placed in the equation for calculating the information gain.
References
Data Mining and KnowledgeDiscovery; Prof. Dr. Nada Lavrač
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition; Ian H. Witten, Eibe Frank
great tutorial!
ReplyDeleteweka rocks, used it for my msc thesis recently
ReplyDeleteweka rocks, used it for my msc thesis recently
ReplyDeletenice one..... can u help me out in how to build classifiers(find accuracy) and the grouping of instances
ReplyDeleteClassifiers are the input variables in the model you use to predict the outcome. Beside the decision tree which is used here, there are other models, such as neural networks, support vector machines and linear regression.
ReplyDeleteAfter choosing a prediction model and the classifiers you split the data into training and evaluation records. The training records are used to determine the weight of each classifier. After building a weighted model you can test the accuracy on evaluation data; simpy check how close the predictions are to right result.
If you are not satisfied with the accuracy you can select a different combination of classifiers. Grouping the classifiers means you reduce the dimension of the space where you observe the data (each classifier could be represented as an axis) and this can be done by Principle Component Analysis (PCA). Some data mining software already has tools to select the best combination of classifiers and group them. You can find good video tutorials for Statistica here: http://www.youtube.com/user/StatSoft
Hi,
ReplyDeleteYour tutorial is good. This example is working but I was trying to do the 'Restaurant' Example of Russel and Norvig's AI book(2nd ed page 656, 3rd ed page 700).
The tree is not producing in the same way as in the book. here the tree is with only the root node 'patrons'. i.e. depth of the tree is only 1.
please help me out this.
Hi rosy! I haven't checked the book you mentioned. when trying to reproduce the results from the book have you checked whether you used in WEKA the same algorithm as the one used in the book?
ReplyDeleteYou can send me the example by email and I can try to solve it and send you back the solution.
Are there any tutorials for J48 using the weka library in java? Any code examples?
ReplyDeleteNice Work.... Great!!!!
ReplyDeleteBut
Can we extract some rules from the decision Tree in WEKA or Not???????