Comparison and evaluation of decision tree classifiers

Description

The process illustrates how to fit decision trees by using different impurity measures and then how to compare these models on the basis of the Congressional Voting Records dataset. After the decision tree is built based on the training and the validation dataset, the best model is selected by the model comparison operator (Model Comparison) using the validation dataset during the decision process. Finally, the quality of the performed classification can be studied on the test dataset. Using the resulting model we can perform the scoring, which is the evaluating of the test set or a data set where we do not know the value of the target variable.

Input

Congressional Voting Records [UCI MLR]

Output

In the process we set the following parameters during the partitioning step.

Figure 16.8. The settings of parameters in the partitioning step

The settings of parameters in the partitioning step

Using the chi-square impurity measure we obtain the following decision tree.

Figure 16.9. The decision tree using the chi-square impurity measure

The decision tree using the chi-square impurity measure

Using the entropy impurity measure we obtain the following decision tree.

Figure 16.10. The decision tree using the entropy impurity measure

The decision tree using the entropy impurity measure

Using the Gini impurity measure we obtain the following decision tree.

Figure 16.11. The decision tree using the Gini-index

The decision tree using the Gini-index

Interpretation of the results

The first of the three resulting decision tree is the simplest and its fitting is the worst. The other two ones are fairly similar to each other, they use the same input variables in the splits, only the splitting values ​​differ a little bit. The three decision trees can be compared in many ways by using graphical tools and statistics. For example, the following chart shows that, on the basis of the cumulative response curve, decision tree given by the Gini index is the best model if we want to build a model until the first few deciles.

Figure 16.12. The cumulative response curve of decision trees

The cumulative response curve of decision trees

Looking at the efficiency of the classifier the number of the correctly or incorrectly classified records can be obtained with respect to the two parties that can be represented by a bar chart.

Figure 16.13. The classification plot

The classification plot

A detailed comparison is possible by using the response curve, the lift curve, and their variants. The following figure shows the (non-cumulative) response curve for the three different datasets and the three types of models. It can be seen how to relate the response curve to the best possible and the baseline one. The bottom right figure shows that the decision tree based on the Gini index is close to the optimal one on the test dataset.

Figure 16.14. Response curve of decision trees

Response curve of decision trees

Another possibility is the examination of the score distributions. The model is fitting well if the red and the blue line is a mirror image of each other and their gradients are high. In virtue of this indicator the entropy-based decision tree is the best.

Figure 16.15. The score distribution of decision trees

The score distribution of decision trees

The three built trees can be improved further by changing the level of significance. As a consequence, the decision tree will be built in a different way comparing to the original one, and as a result of this the number of correctly and incorrectly classified records and their distribution will be different. The performance of the resulting models can be read in the following figure, on which the incorrect classification rate was underlined as one of the most important indicators.

Figure 16.16. The main statistics of decision trees

The main statistics of decision trees

Video

Workflow

sas_dtree_exp2.xml

Keywords

classification
decision tree
performance
evaluation

Operators

Data Source
Decision Tree
Model Comparison
Data Partition
Score