The influence of the number of base classifiers to the performance of the random forest

Description

The process demonstrates the influence of the number of base classifiers on the classification error rate of the random forest in the case of the Heart Disease data set. The number of base classifiers (i.e., decision trees) are increased from 1 to 20 in the experiment, and the average classification error rate of the random forest from 10-fold cross-validation is determined in each step. The impurity measure used for the decision trees is the gain ratio.

Note

The experiment is the same as the previous two, the only difference is that the Random Forest operator is used here instead of the Bagging and the AdaBoost operators.

Input

Heart Disease [UCI MLR]

Note

The data set was donated to the UCI Machine Learning Repository by R. Detrano [Detrano et al.].

Output

Figure 9.5. The average error rate of the random forest obtained from 10-fold cross-validation against the number of base classifiers.

The average error rate of the random forest obtained from 10-fold cross-validation against the number of base classifiers.

Interpretation of the results

The figure shows that the best average classification error rate (19.1%) is achieved when the number of base classifiers is 10.

Note that the best performance obtained is slightly better than those of AdaBoost (22.7%), but requires more base classifiers. Moreover, the performance of AdaBoost behaves more predictable than those of the random forest.

Video

Workflow

ensemble_exp4.rmp

Keywords

random forest
ensemble methods
supervised learning
error rate
cross-validation
classification

Operators

Apply Model
Log
Loop Parameters
Map
Performance (Classification)
Random Forest
Read CSV
X-Validation