The process shows, using the Zoo dataset, under which conditions can under- and oversampling appear when performing classification using decision trees. If the decision tree which provides the model has a depth that is too small, it can occur that it cannot explore the structure of the training set in its entirety, thus it is inappropriate to carry out the classification properly. This is a case of undersampling. However, if the records are split more that required, such conclusions can be drawn along the decisions that are not true anymore, and following this excess of splitting rules, inappropriate decisions can be made - for example in the case of irregular records. This is considered a case of oversampling.
Zoo [UCI MLR]
In this process, operators building similar decision trees are used for the same training set, and in them, only the stop condition defining the maximal depth of the tree is different. The value of the maximal depth is 3, 6, and 9 respectively.
In accordance with this, decision trees of different depths are created, which thus contain different amounts of splitting conditions, based on which the records of the test set will be classified differently by the different models. If the value of the maximal value is 3, the following decision tree is received as a result:
It can be seen here that based on using 2 rules, the 7 possible classes cannot be separated by the model, so this is a clear case of undersampling. If the value of the maximal value is 3, the following decision tree is received as a result:
Figure 5.10. Graphic representation of he classification of the records based on the decision tree with increased maximal depth
In this case, only 3 of the records are classified differently from their original labels. However, if the threshold for the maximal depth is increased further, the result will not be better, rather, it will worsen, as the additional rules lead to inappropriate consequences, i.e. this is a case of oversampling. If the value of the maximal value is 3, the following decision tree is received as a result:
Figure 5.11. Graphic representation of the decision tree created with the further increased maximal depth
Figure 5.12. Graphic representation of he classification of the records based on the decision tree with further increased maximal depth