Performance of association rules - Simpson's paradox

Description

The process shows, using the Titanic dataset, how the usability and efficiency of association rules can be enhanced if association rules are being extracted for a given dataset, by creating subsets of the dataset based on the connections between its data, and then creating association rules separately for its subsets. After extracting the association rules, their support can be evaluated, and similarly to classification tasks, it can be checked to what extent the original values of the dataset can be predicted based on the rules created. If these values fail to reach the expected levels, one of the reasons for this can be the so-called Simpson's paradox, which means that due to some hidden factors, the connections between the variables can weaken, disappear, or even turn in the opposite direction. If such factors are discovered, splitting the dataset along these can enhance the performance of the association rules.

Input

Titanic [Titanic]

Output

Using this dataset, it can be examined whether the age, sex, and class of the passengers of the Titanic had any influence on their survival chances. After the appropriate conversion of the variables, the dataset can be split into a training set and a test set, and then, by applying the association rules deduced based on the training set to the test set, it can be defined to what extent the rules are usable. However, if we do this based on the whole of the dataset, relatively poor results emerge for support and, rooting from this, for performance as well:

Figure 10.20. List of the association rules generated regarding survival

List of the association rules generated regarding survival

Figure 10.21. Performance vector for the application of association rules generated

Performance vector for the application of association rules generated

But considering the contingency table of the dataset, for example split by the age of the passengers, and by their class, the conclusion can be drawn that the individual variables have such a strong influence on the value of the variable of interest - survival - that these effects of the individual classes can neutralize each other in the whole of the dataset, so it can be more advantageous to split the dataset along these variables, and extract the association rules separately in the individual subsets:

Figure 10.22. Contingency table of the dataset

Contingency table of the dataset

In order to do this, for example if the dataset is to be split based on the age of the passengers, first the appropriate records have to be filtered out, then the variables used as filtering conditions can also be removed, as in the subsets, they carry information that can now be considered redundant:

Figure 10.23. Record filter usage

Record filter usage

Figure 10.24. Removal of attributes that become redundant after filtering

Removal of attributes that become redundant after filtering

Interpretation of the results

After this, the training and test sets are created, the association rules concerning them are extracted, and their efficiency is evaluated for the separate datasets of adults and children. The subset of adults yielded the following results:

Figure 10.25. List of the association rules generated for the subset of adults

List of the association rules generated for the subset of adults

Figure 10.26. Performance vector for the application of association rules generated regarding survival for the subset of adults

Performance vector for the application of association rules generated regarding survival for the subset of adults

The subset of children yielded the following results:

Figure 10.27. List of the association rules generated for the subset of children

List of the association rules generated for the subset of children

Figure 10.28. Performance vector for the application of association rules generated regarding survival for the subset of children

Performance vector for the application of association rules generated regarding survival for the subset of children

It can be seen that performance can be increased remarkably by such splits of datasets, as by doing this, the interference between the effects of groups can be neutralized. For the group of children, the enhancement in performance is much smaller, but this can be explained with the much smaller record count of the subset.

Video

Workflow

assoc_exp4.rmp

Keywords

frequent item sets
association rules
performance
support
Simpson's paradox

Operators

Apply Association Rules
Create Association Rules
Discretize by User Specification
Filter Examples
FP-Growth
Multiply
Nominal to Binominal
Performance
Read AML
Select Attributes
Set Role
Split Data