Under- and overfitting of a classification with a decision tree

Description

The process shows, using the Zoo dataset, under which conditions can under- and oversampling appear when performing classification using decision trees. If the decision tree which provides the model has a depth that is too small, it can occur that it cannot explore the structure of the training set in its entirety, thus it is inappropriate to carry out the classification properly. This is a case of undersampling. However, if the records are split more that required, such conclusions can be drawn along the decisions that are not true anymore, and following this excess of splitting rules, inappropriate decisions can be made - for example in the case of irregular records. This is considered a case of oversampling.

Input

Zoo [UCI MLR]

Output

In this process, operators building similar decision trees are used for the same training set, and in them, only the stop condition defining the maximal depth of the tree is different. The value of the maximal depth is 3, 6, and 9 respectively.

Figure 5.6. Setting a threshold for the maximal depth of the decision tree

Setting a threshold for the maximal depth of the decision tree

Interpretation of the results

In accordance with this, decision trees of different depths are created, which thus contain different amounts of splitting conditions, based on which the records of the test set will be classified differently by the different models. If the value of the maximal value is 3, the following decision tree is received as a result:

Figure 5.7. Graphic representation of the decision tree created

Graphic representation of the decision tree created

Figure 5.8. Graphic representation of he classification of the records based on the decision tree

Graphic representation of he classification of the records based on the decision tree

It can be seen here that based on using 2 rules, the 7 possible classes cannot be separated by the model, so this is a clear case of undersampling. If the value of the maximal value is 3, the following decision tree is received as a result:

Figure 5.9. Graphic representation of the decision tree created with the increased maximal depth

Graphic representation of the decision tree created with the increased maximal depth

Figure 5.10. Graphic representation of he classification of the records based on the decision tree with increased maximal depth

Graphic representation of he classification of the records based on the decision tree with increased maximal depth

In this case, only 3 of the records are classified differently from their original labels. However, if the threshold for the maximal depth is increased further, the result will not be better, rather, it will worsen, as the additional rules lead to inappropriate consequences, i.e. this is a case of oversampling. If the value of the maximal value is 3, the following decision tree is received as a result:

Figure 5.11. Graphic representation of the decision tree created with the further increased maximal depth

Graphic representation of the decision tree created with the further increased maximal depth

Figure 5.12. Graphic representation of he classification of the records based on the decision tree with further increased maximal depth

Graphic representation of he classification of the records based on the decision tree with further increased maximal depth

Video

Workflow

dtree_exp2.rmp

Keywords

classification
decision tree
overfitting
underfitting

Operators

Apply Model
Decision Tree
Multiply
Read AML
Split Data