Comparison of decision tree classifiers

Description

The process shows, using the Spambase dataset, how the quality of multiple classifiers, the efficiency of multiple classifiers can be compared. After the decision trees of the classifiers have been built based on the training set, the test set can be classified using them, and the quality of the individually executed classifications can be examined. This can be done separately, measuring the precision of the classifiers one by one, or the analyses can be merged, and the ROC curves of the individual classifiers can be represented on a common figure for a better picturing of the differences between the results. Based on the thus received evaluation, it can be decided which classifier suits the requirements of the process, whether a given model should be improved, or whether a given model has to be replaced or removed due to its poor performance.

Input

Spambase [UCI MLR]

Output

Let us create the following two decision tree classifiers based on the training set of the data set:

Figure 5.27. Preferences for the building of the decision tree based on the Gini-index criterion

Preferences for the building of the decision tree based on the Gini-index criterion

Figure 5.28. Preferences for the building of the decision tree based on the gain ratio criterion

Preferences for the building of the decision tree based on the gain ratio criterion

The classifier using the gain ratio builds the following decision tree:

Figure 5.29. Graphic representation of the decision tree created based on the gain ratio criterion

Graphic representation of the decision tree created based on the gain ratio criterion

When applied to the test set, this decision tree yields the following performance values:

Figure 5.30. Performance vector of the classification based on the decision tree built using the gain ratio criterion

Performance vector of the classification based on the decision tree built using the gain ratio criterion

On the contrary, the classifier using the Gini-index builds the following decision tree:

Figure 5.31. Graphic representation of the decision tree created based on the Gini-index criterion

Graphic representation of the decision tree created based on the Gini-index criterion

When applied to the test set, this decision tree yields the following performance values:

Figure 5.32. Performance vector of the classification based on the decision tree built using the Gini-index criterion

Performance vector of the classification based on the decision tree built using the Gini-index criterion

Interpretation of the results

It can be seen that the performance of the classifier utilizing the Gini-index is better than that of the classifier based on the gain ratio. However, on one hand, the difference between individual models is not this obvious in all cases, and on the other hand, for the sake of simplification, or to avoid the differences caused by sampling, the evaluation of the individual models can be merged into a complex operator, and thus, the ROC curves of their precision can also be displayed on a single figure, for example as follows:

Figure 5.33. Settings of the operator for the comparison of ROC curves

Settings of the operator for the comparison of ROC curves

Figure 5.34. Subprocess of the operator for the comparison of ROC curves

Subprocess of the operator for the comparison of ROC curves

In this case, an arbitrary number of the model building operators can be placed in the complex operator, thus the precision of this arbitrary number of models can be examined for the same data set at once. However, it is advisable to use a local random seed to ensure that the comparison will be repeatable, as this ensures that for any execution, the records will be split into training and test sets in the same manner.

Figure 5.35. Comparison of the ROC curves of the two decision tree classifiers

Comparison of the ROC curves of the two decision tree classifiers

Based on the ROC curves, it is obvious that the classifier based on the Gini-index has a much higher precision than that of the classifier using the gain ratio, as its ROC curve is curved more in the direction of the point (0,1) from the diagonal between the points (0,0) and (1,1).

Video

Workflow

dtree_exp5.rmp

Keywords

classification
decision tree
performance
comparison
ROC curve

Operators

Apply Model
Compare ROCs
Decision Tree
Multiply
Performance
Read AML
Split Data