The process illustrates how to fit decision trees by using different impurity measures and
then how to compare these models on the basis of the
Congressional Voting Records
dataset. After the decision tree is built based on the training and the validation dataset,
the best model is selected by the model comparison operator (
using the validation dataset during the decision process. Finally, the quality of the performed classification
can be studied on the test dataset. Using the resulting model we can perform the scoring, which is
the evaluating of the test set or a data set where we do not know the value of the target variable.
Congressional Voting Records [UCI MLR]
In the process we set the following parameters during the partitioning step.
Using the chi-square impurity measure we obtain the following decision tree.
Using the entropy impurity measure we obtain the following decision tree.
Using the Gini impurity measure we obtain the following decision tree.
The first of the three resulting decision tree is the simplest and its fitting is the worst. The other two ones are fairly similar to each other, they use the same input variables in the splits, only the splitting values differ a little bit. The three decision trees can be compared in many ways by using graphical tools and statistics. For example, the following chart shows that, on the basis of the cumulative response curve, decision tree given by the Gini index is the best model if we want to build a model until the first few deciles.
Looking at the efficiency of the classifier the number of the correctly or incorrectly classified records can be obtained with respect to the two parties that can be represented by a bar chart.
A detailed comparison is possible by using the response curve, the lift curve, and their variants. The following figure shows the (non-cumulative) response curve for the three different datasets and the three types of models. It can be seen how to relate the response curve to the best possible and the baseline one. The bottom right figure shows that the decision tree based on the Gini index is close to the optimal one on the test dataset.
Another possibility is the examination of the score distributions. The model is fitting well if the red and the blue line is a mirror image of each other and their gradients are high. In virtue of this indicator the entropy-based decision tree is the best.
The three built trees can be improved further by changing the level of significance. As a consequence, the decision tree will be built in a different way comparing to the original one, and as a result of this the number of correctly and incorrectly classified records and their distribution will be different. The performance of the resulting models can be read in the following figure, on which the incorrect classification rate was underlined as one of the most important indicators.