Table of Contents
The process demonstrates how to classify by the
Decision Tree operator in the
case when the target is a nominal attribute. In this case the
dataset is used and the target variable has three values. In order to build a decision tree classifier
it is worth to divide the dataset to training and validation datasets. Then the current best splitting
rule is found by the algorithm on the training set, but the growth of the tree is stopped by using
the validation dataset when the algorithm does not find a significant split.
In partitioning step, a test dataset can be separated too in order to measure the generalization ability
of the resulting tree, but now this is not recommended due to the limited size of the data set. The
decision tree as the result of the process can be displayed where we can see the decisions at the
splittings of the model. Using the principle of majority voting the algorithm decides which class
label should assign to each leaf (terminal nodes).
Wine [UCI MLR]
In case of nominal target variable, we can decide about the execution of each split on the basis of
various impurity measures such as the chi-square, the Gini index, or the entropy. For these, and
for the reliability of splitting, depending on the choosen measure, a parameter value can be specified.
In addition, the stopping condition of splitting can be determined by the way that we give the minimum size
set of records can be divided even further, or the maximal depth of the tree. Also we may set the maximum
number of branches of a tree. The default is
2, that is, the algorithm builds a binary tree.
It is also possible to decide if we wish to use the missing values in splitting used as possible value.
We can also decide that the input attributes are used only once or several times when the decision tree
In the partition of dataset different sampling methods can be choosen and the proportion can be determined among the training, validation, and test dataset. This partitioning can be carried out simply by considering the order of records, randomly, or stratifying with respect to the target variable. The stratified sampling ensures the same proportion of each class in the training, validation, and test set.
The results of the classification can be seen in the decision tree for the training and validation dataset as well, including the number of records in each vertex of the tree according to each class, respectively. On the edges between vertices the variables that define the splittings and their splitting values are presented. The thickness of the lines is proportional to the number of concerned records.
The evaluation of the resulting decision tree is supported by numerous statistical indicators and graphical tools. Among them, the most important ones are displayed by multiple windows at a time where we can do comparisons. These windows can also be opened one by one by the view menu. By the help of this tools wrong decisions can be filtered out and the modeling process can be tuned by using further background information or domain knowledge. In this process, an interactive tree building process also helps.
On the response curve above it can be seen, for the training and validation dataset, based on the ranking of records according to their goodness how many percent of the records are classified correctly. The curve is generally monotonically decreasing.
In the Fit Statistics table different indicators can be seen on the fiiting of the decision tree classifier produced by the algorithm. The simplest and most important one among them is the misclassification rate in the red circle, which shows the proportion of the wrong classification.
On the classification bar chart we can look at the details which classes work well or poorly the model.
On the figure, it can be concluded how relates the resulting decision tree to the best possible model based on the cumulative lift value.
The variable importance measure table shows which variables and with what importance involved in the decision of the decision tree. This is a useful tool for the users who possess some speciality knowledge.