The process shows, using the Titanic dataset, how the usability and efficiency of association rules can be enhanced if association rules are being extracted for a given dataset, by creating subsets of the dataset based on the connections between its data, and then creating association rules separately for its subsets. After extracting the association rules, their support can be evaluated, and similarly to classification tasks, it can be checked to what extent the original values of the dataset can be predicted based on the rules created. If these values fail to reach the expected levels, one of the reasons for this can be the so-called Simpson's paradox, which means that due to some hidden factors, the connections between the variables can weaken, disappear, or even turn in the opposite direction. If such factors are discovered, splitting the dataset along these can enhance the performance of the association rules.
Using this dataset, it can be examined whether the age, sex, and class of the passengers of the Titanic had any influence on their survival chances. After the appropriate conversion of the variables, the dataset can be split into a training set and a test set, and then, by applying the association rules deduced based on the training set to the test set, it can be defined to what extent the rules are usable. However, if we do this based on the whole of the dataset, relatively poor results emerge for support and, rooting from this, for performance as well:
But considering the contingency table of the dataset, for example split by the age of the passengers, and by their class, the conclusion can be drawn that the individual variables have such a strong influence on the value of the variable of interest - survival - that these effects of the individual classes can neutralize each other in the whole of the dataset, so it can be more advantageous to split the dataset along these variables, and extract the association rules separately in the individual subsets:
In order to do this, for example if the dataset is to be split based on the age of the passengers, first the appropriate records have to be filtered out, then the variables used as filtering conditions can also be removed, as in the subsets, they carry information that can now be considered redundant:
After this, the training and test sets are created, the association rules concerning them are extracted, and their efficiency is evaluated for the separate datasets of adults and children. The subset of adults yielded the following results:
Figure 10.26. Performance vector for the application of association rules generated regarding survival for the subset of adults
The subset of children yielded the following results:
Figure 10.28. Performance vector for the application of association rules generated regarding survival for the subset of children
It can be seen that performance can be increased remarkably by such splits of datasets, as by doing this, the interference between the effects of groups can be neutralized. For the group of children, the enhancement in performance is much smaller, but this can be explained with the much smaller record count of the subset.