Case Studies in Data Mining

András Fülöp

e-Ventures Ltd.

László Gonda

University of Debrecen
Faculty of Informatics

Dr.. Márton Ispány

University of Debrecen
Faculty of Informatics

Dr.. Péter Jeszenszky

University of Debrecen
Faculty of Informatics

Dr.. László Szathmáry

University of Debrecen
Faculty of Informatics

A tananyag a TÁMOP-4.1.2.A/1-11/1-2011-0103 azonosítójú pályázat keretében valósulhatott meg.

2014


Table of Contents

Preface
How to use this material
I. Data mining tools
1. Commercial data mining softwares
2. Free and shareware data mining softwares
II. RapidMiner
3. Data Sources
Importing data from a CSV file
Importing data from an Excel file
Creating an AML file for reading a data file
Importing data from an XML file
Importing data from a database
4. Pre-processing
Managing data with issues - Missing, inconsistent, and duplicate values
Sampling and aggregation
Creating and filtering attributes
Discretizing and weighting attributes
5. Classification Methods 1
Classification using a decision tree
Under- and overfitting of a classification with a decision tree
Evaluation of performance for classification by decision tree
Evaluation of performance for classification by decision tree 2
Comparison of decision tree classifiers
6. Classification Methods 2
Using a rule-based classifier (1)
Using a rule-based classifier (2)
Transforming a decision tree to an equivalent rule set
7. Classification Methods 3
Linear regression
Osztályozás lineáris regresszióval
Evaluation of performance for classification by regression model
Evaluation of performance for classification by regression model 2
8. Classification Methods 4
Using a perceptron for solving a linearly separable binary classification problem
Using a feed-forward neural network for solving a classification problem
The influence of the number of hidden neurons to the performance of the feed-forward neural network
Using a linear SVM for solving a linearly separable binary classification problem
The influence of the parameter C to the performance of the linear SVM (1)
The influence of the parameter C to the performance of the linear SVM (2)
The influence of the parameter C to the performance of the linear SVM (3)
The influence of the number of training examples to the performance of the linear SVM
Solving the two spirals problem by a nonlinear SVM
The influence of the kernel width parameter to the performance of the RBF kernel SVM
Search for optimal parameter values of the RBF kernel SVM
Using an SVM for solving a multi-class classification problem
Using an SVM for solving a regression problem
9. Classification Methods 5
Introducing ensemble methods: the bagging algorithm
The influence of the number of base classifiers to the performance of bagging
The influence of the number of base classifiers to the performance of the AdaBoost method
The influence of the number of base classifiers to the performance of the random forest
10. Association rules
Extraction of association rules
Asszociációs szabályok kinyerése nem tranzakciós adathalmazból
Evaluation of performance for association rules
Performance of association rules - Simpson's paradox
11. Clustering 1
K-means method
K-medoids method
The DBSCAN method
Agglomerative methods
Divisive methods
12. Clustering 2
Support vector clustering
Choosing parameters in clustering
Cluster evaluation
Centroid method
Text clustering
13. Anomaly detection
Searching for outliers
Unsupervised search for outliers
Unsupervised statistics based anomaly detection
III. SAS® Enterprise Miner
14. Data Sources
Reading SAS dataset
Importing data from a CSV file
Importing data from a Excel file
15. Preprocessing
Constructing metadata and automatic variable selection
Vizualizing multidimensional data and dimension reduction by PCA
Replacement and imputation
16. Classification Methods 1
Classification by decision tree
Comparison and evaluation of decision tree classifiers
17. Classification Methods 2
Rule induction to the classification of rare events
18. Classification Methods 3
Logistic regression
Prediction of discrete target by regression models
19. Classification Methods 4
Solution of a linearly separable binary classification task by ANN and SVM
Using artificial neural networks (ANN)
Using support vector machines (SVM)
20. Classification Methods 5
Ensemble methods: Combination of classifiers
Ensemble methods: bagging
Ensemble methods: boosting
21. Association mining
Extracting association rules
22. Clustering 1
K-means method
Agglomerative hierarchical methods
Comparison of clustering methods
23. Clustering 2
Clustering attributes before fitting SVM
Self-organizing maps (SOM) and vector quantization (VQ)
24. Regression for continuous target
Logistic regression
Prediction of discrete target by regression models
Supervised models for continuous target
25. Anomaly detection
Detecting outliers
Bibliography

List of Figures

3.1. Metadata of the resulting ExampleSet.
3.2. A small excerpt of the resulting ExampleSet.
3.3. Metadata of the resulting ExampleSet.
3.4. A small excerpt of the resulting ExampleSet.
3.5. The resulting AML file.
3.6. A small excerpt of The World Bank: Population (Total) data set used in the exepriment.
3.7. Metadata of the resulting ExampleSet.
3.8. A small excerpt of the resulting ExampleSet.
3.9. Metadata of the resulting ExampleSet.
3.10. A small excerpt of the resulting ExampleSet.
4.1. Graphic representation of the global and kitchen power consumption in time
4.2. Possible outliers based on the hypothesized hbits of the members of the household
4.3. Filtering of the possible values using a record filter
4.4. Selection of aggregate functions for attributes
4.5. Preferences for dataset sampling
4.6. Preferences for dataset filtering
4.7. Resulting dataset after dataset sampling
4.8. Resulting dataset after dataset filtering
4.9. Defining a new attribute based on an expression relying on existing attributes
4.10. Properties of the operator used for removing the attributes made redundant
4.11. Selection of the attributes to remain in the dataset with reduced size
4.12. The appearance of the derived attribute in the altered dataset
4.13. Selection of the appropriate discretization operator
4.14. Setting the properties of the discretization operator
4.15. Selection of the appropriate weighting operator
4.16. Defining the weights of the individual attributes
4.17. Comparison of the weighted and unweighted dataset instances
5.1. Preferences for the building of the decision tree
5.2. Preferences for splitting the dataset into training and test sets
5.3. Setting the relative sizes of the data partitions
5.4. Graphic representation of the decision tree created
5.5. The classification of the records based on the decision tree
5.6. Setting a threshold for the maximal depth of the decision tree
5.7. Graphic representation of the decision tree created
5.8. Graphic representation of he classification of the records based on the decision tree
5.9. Graphic representation of the decision tree created with the increased maximal depth
5.10. Graphic representation of he classification of the records based on the decision tree with increased maximal depth
5.11. Graphic representation of the decision tree created with the further increased maximal depth
5.12. Graphic representation of he classification of the records based on the decision tree with further increased maximal depth
5.13. Preferences for the building of the decision tree
5.14. Graphic representation of the decision tree created
5.15. Graphic representation of he classification of the records based on the decision tree
5.16. Performance vector of the classification based on the decision tree
5.17. The modification of preferences for the building of the decision tree.
5.18. Graphic representation of the decision tree created with the modified preferences
5.19. Performance vector of the classification based on the decision tree created with the modified preferences
5.20. Settings for the sampling done in the validation operator
5.21. Subprocesses of the validation operator
5.22. Graphic representation of the decision tree created
5.23. Performance vector of the classification based on the decision tree
5.24. Settings of the cross-validation operator
5.25. Overall performance vector of the classifications done in the cross-validation operator
5.26. Overall performance vector of the classifications done in the cross-validation operator in the leave-one-out case
5.27. Preferences for the building of the decision tree based on the Gini-index criterion
5.28. Preferences for the building of the decision tree based on the gain ratio criterion
5.29. Graphic representation of the decision tree created based on the gain ratio criterion
5.30. Performance vector of the classification based on the decision tree built using the gain ratio criterion
5.31. Graphic representation of the decision tree created based on the Gini-index criterion
5.32. Performance vector of the classification based on the decision tree built using the Gini-index criterion
5.33. Settings of the operator for the comparison of ROC curves
5.34. Subprocess of the operator for the comparison of ROC curves
5.35. Comparison of the ROC curves of the two decision tree classifiers
6.1. The rule set of the rule-based classifier trained on the data set.
6.2. The classification accuracy of the rule-based classifier on the data set.
6.3. The rule set of the rule-based classifier.
6.4. The classification accuracy of the rule-based classifier on the training set.
6.5. The classification accuracy of the rule-based classifier on the test set.
6.6. The decision tree built on the data set.
6.7. The rule set equivalent of the decision tree.
6.8. The classification accuracy of the rule-based classifier on the data set.
7.1. Properties of the linear regression operator
7.2. The linear regression model yielded as a result
7.3. The class prediction values calculated based on the linear regression model
7.4. The subprocess of the classification by regression operator
7.5. The linear regression model yielded as a result
7.6. The class labels derived from the predictions calculated based on the regression model
7.7. The subprocess of the classification by regression operator
7.8. The linear regression model yielded as a result
7.9. The performance vector of the classification based on the regression model
7.10. The subprocess of the cross-validation by regression operator
7.11. The subprocess of the classification by regression operator
7.12. The linear regression model yielded as a result
7.13. The customizable properties of the cross-validation operator
7.14. The overall performance vector of the classifications done using the regression model defined in the cross-validation operator
7.15. The overall performance vector of the classifications done using the regression model defined in the cross-validation operator for the case of using the leave-one-out method
8.1. A linearly separable subset of the Wine data set [UCI MLR] used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected).
8.2. The decision boundary of the perceptron.
8.3. The classification accuracy of the perceptron on the data set.
8.4. The classification accuracy of the neural network on the data set.
8.5. The average classification error rate obtained from 10-fold cross-validation against the number of hidden neurons.
8.6. A linearly separable subset of the Wine data set [UCI MLR] used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected).
8.7. The kernel model of the linear SVM.
8.8. The classification accuracy of the linear SVM on the data set.
8.9. A subset of the Wine data set used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected). Note that the classes are not linearly separable.
8.10. The classification error rate of the linear SVM against the value of the parameter C.
8.11. The number of support vectors against the value of the parameter C.
8.12. The average classification error rate of the linear SVM obtained from 10-fold cross-validation against the value of the parameter C, where the horizontal axis is scaled logarithmically.
8.13. The classification error rate of the linear SVM on the training and the test sets against the value of the parameter C.
8.14. The number of support vectors against the value of the parameter C.
8.15. The classification error rate of the linear SVM on the training and the test sets against the number of training examples.
8.16. The number of support vectors against the number of training examples.
8.17. CPU execution time needed to train the SVM against the number of training examples.
8.18. The Two Spirals data set
8.19. The R code that produces the data set and is executed by the Execute Script (R) operator of the R Extension.
8.20. The classification accuracy of the nonlinear SVM on the data set.
8.21. The classification error rates of the SVM on the training and the test sets against the value of RBF kernel width parameter.
8.22. The optimal parameter values for the RBF kernel SVM.
8.23. The classification accuracy of the RBF kernel SVM trained on the entire data set using the optimal parameter values.
8.24. The kernel model of the linear SVM.
8.25. The classification accuracy of the linear SVM on the data set.
8.26. The optimal value of the gamma parameter for the RBF kernel SVM.
8.27. The average RMS error of the RBF kernel SVM obtained from 10-fold cross-validation against the value of the parameter gamma, where the horizontal axis is scaled logarithmically.
8.28. The kernel model of the optimal RBF kernel SVM.
8.29. Predictions provided by the optimal RBF kernel SVM against the values of the observed values of the dependent variable.
9.1. The average classification error rate of a single decision stump obtained from 10-fold cross-validation.
9.2. The average classification error rate of the bagging algorithm obtained from 10-fold cross-validation, where 10 decision stumps were used as base classifiers.
9.3. The average classification error rate obtained from 10-fold cross-validation against the number of base classifiers.
9.4. The average classification error rate obtained from 10-fold cross-validation against the number of base classifiers.
9.5. The average error rate of the random forest obtained from 10-fold cross-validation against the number of base classifiers.
10.1. List of the frequent item sets generated
10.2. List of the association rules generated
10.3. Graphic representation of the association rules generated
10.4. Operator preferences for the necessary data conversion
10.5. Converted version of the dataset
10.6. List of the frequent item sets generated
10.7. List of the association rules generated
10.8. Operator preferences for the appropriate data conversion
10.9. The appropriate converted version of the dataset
10.10. Enhanced list of the frequent item sets generated
10.11. List of the association rules generated
10.12. Graphic representation of the association rules generated
10.13. Operator preferences for the necessary data conversion
10.14. Label role assignment for performance evaluation
10.15. Prediction role assignment for performance evaluation
10.16. Operator preferences for the data conversion necessary for evaluation
10.17. Graphic representation of the association rules generated regarding survival
10.18. List of the association rules generated regarding survival
10.19. Performance vector for the application of association rules generated
10.20. List of the association rules generated regarding survival
10.21. Performance vector for the application of association rules generated
10.22. Contingency table of the dataset
10.23. Record filter usage
10.24. Removal of attributes that become redundant after filtering
10.25. List of the association rules generated for the subset of adults
10.26. Performance vector for the application of association rules generated regarding survival for the subset of adults
10.27. List of the association rules generated for the subset of children
10.28. Performance vector for the application of association rules generated regarding survival for the subset of children
11.1. The 7 separate groups
11.2. Clustering with default values
11.3. Set the distance function.
11.4. Clustering with Mahalanobis distance function
11.5. The dataset
11.6. Setting the parameters of the clustering
11.7. The clusters produced by the analysis
11.8. The groups with varying density
11.9. The results of the method with default parameters
11.10. The 15 group
11.11. The resulting dendrogram
11.12. The clustering generated from dendrogram
11.13. The 600 two-dimensional vectors
11.14. The subprocess
11.15. The report generated by the clustering
11.16. The output of the analysis
12.1. The two groups
12.2. Support vector clustering with polynomial kernel and p=0.21 setup
12.3. Unsuccessful clustering
12.4. Clustering with RBF kernel
12.5. More promising results
12.6. The two groups containing 240 vectors
12.7. The subprocess of the optimalization node
12.8. The parameters of the optimalization
12.9. The report generated by the process
12.10. Clustering generated with the most optimal parameters
12.11. The 788 vectors
12.12. The evaluating subprocess
12.13. Setting up the parameters
12.14. Parameters to log
12.15. Cluster density against k number of clusters
12.16. Item distribution against k number of clusters
12.17. The vectors forming 31 clusters
12.18. The extracted centroids
12.19. The output of the k nearest neighbour method, using the centroids as prototypes
12.20. The preprocessing subprocess
12.21. The clustering setup
12.22. The confusion matrix of the results
13.1. Graphic representation of the possible outliers
13.2. The number of outliers detected as the distance limit grows
13.3. Nearest neighbour based operators in the Anomaly Detection package
13.4. Settings of LOF.
13.5. Outlier scores assigned to the individual records based on k nearest neighbours
13.6. Outlier scores assigned to the individual records based on LOF
13.7. Filtering the records based on their outlier scores
13.8. The dataset filtered based on the k-NN score
13.9. The dataset filtered based on the LOF score
13.10. Global settings for Histogram-based Outlier Score
13.11. Column-level settings for Histogram-based Outlier Score
13.12. Scores and attribute binning for fixed binwidth and arbitrary number of bins
13.13. Graphic representation of outlier scores
13.14. Scores and attributes binning for dynamic binwidth and arbitrary number of bins
13.15. Graphic representation of the enhanced outlier scores
14.1. The metadata of the dataset
14.2. Setting the Sample operator
14.3. The metadata of the resulting dataset and a part of the dataset
14.4. The list of file in the File Import operator
14.5. The parameters of the File Import operator
14.6. A small portion of the dataset
14.7. The metadata of the resulting dataset
14.8. A small portion of the resulting dataset
15.1. Metadata produced by the DMDB operator
15.2. The settings of Variable Selection operator
15.3. List of variables after the selection
15.4. Sequential R-square plot
15.5. The binary target variables in a function of the two most important input attributes after the variable selection
15.6. Displaying the dataset by parallel axis
15.7. Explained cumulated explained variance plot of the PCA
15.8. Scatterplit of the Iris dataset using the first two principal components
15.9. The replacement wizard
15.10. The output of imputation
15.11. The relationship of an input and the target variable before imputation
15.12. The relationship of an input and the target variable after imputation
16.1. The settings of dataset partitioning
16.2. The decision tree
16.3. The response curve of the decision tree
16.4. Fitting statistics of the decision tree
16.5. The classification chart of the decision tree
16.6. The cumulative lift curve of the decision tree
16.7. The importance of attributes
16.8. The settings of parameters in the partitioning step
16.9. The decision tree using the chi-square impurity measure
16.10. The decision tree using the entropy impurity measure
16.11. The decision tree using the Gini-index
16.12. The cumulative response curve of decision trees
16.13. The classification plot
16.14. Response curve of decision trees
16.15. The score distribution of decision trees
16.16. The main statistics of decision trees
17.1. The misclassification rate of rule induction
17.2. The classification matrix of rule induction
17.3. The classification chart of rule induction
17.4. The ROC curves of rule inductions and decision tree
17.5. The output of the rule induction operator
18.1. Classification matrix of the logistic regression
18.2. Effects plot of the logistic regression
18.3. Classification matrix of the stepwise logistic regression
18.4. Effects plot of the stepwise logistic regression
18.5. Fitting statistics for logistic regression models
18.6. Classification charts of the logistic regression models
18.7. Cumulativ lift curve of the logistic regression models
18.8. ROC curves of the logistic regression models
18.9. Classification matrix of the logistic regression
18.10. The classification chart of the logistic regression
18.11. The effects plot of the logistic regression
19.1. A linearly separable subset of the Wine dataset
19.2. Fitting statistics for perceptron
19.3. The classification matrix of the perceptron
19.4. The cumulative lift curve of the perceptron
19.5. Fitting statistics for SVM
19.6. The classification matrix of SVM
19.7. The cumulative lift curve of SVM
19.8. List of the support vectors
19.9. Fitting statistics of the multilayer perceptron
19.10. The classification matrix of the multilayer perceptron
19.11. The cumulative lift curve of the multilayer perceptron
19.12. Weights of the multilayer perceptron
19.13. Training curve of the multilayer perceptron
19.14. Stepwise optimization statistics for DMNeural operator
19.15. Weights of neural networks AutoNeural operátorral kapott háló neuronjainak súlyai
19.16. Fitting statistics of neural networks
19.17. Classification charts of neural networks
19.18. Cumulative lift curves of neural networks
19.19. ROC curves of neural networks
19.20. Fitting statistics for linear kernel SVM
19.21. The classification matrix of linear kernel SVM
19.22. Support vectors (partly) of linear kernel SVM
19.23. The distribution of Lagrange multipliers for linear kernel SVM
19.24. The parameters of polynomial kernel SVM
19.25. Fitting statistics for polynomial kernel SVM
19.26. Classification matrix of polynomial kernel SVM
19.27. Support vectors (partly) of polynomial kernel SVM
19.28. Fitting statistics for SVM's
19.29. The classification chart of SVM's
19.30. Cumulative lift curves of SVM's
19.31. Comparison of cumulative lift curves to the baseline and the optimal one
19.32. ROC curves of SVM's
20.1. Fitting statistics of the ensemble classifier
20.2. The classification matrix of the ensemble classifier
20.3. The cumulative lift curve of the ensemble classifier
20.4. Misclassification rates of the ensemble classifier and the SVM
20.5. Classification matrices of the ensemble classifier and the SVM
20.6. Cumulative lift curves of the ensemble classifier and the SVM
20.7. Cumulative lift curves of the ensemble classifier, the SVM and the best theoretical model
20.8. ROC curves of the ensemble classifier and the SVM
20.9. The classification matrix of the bagging classifier
20.10. The error curves of the bagging classifier
20.11. Misclassification rates of the bagging classifier and the decision tree
20.12. Classification matrices of the bagging classifier and the decision tree
20.13. Response curves of the bagging classifier and the decision tree
20.14. Response curves of the bagging classifier and the decision tree comparing the baseline and the optimal classifiers
20.15. ROC curves of the bagging classifier and the decision tree
20.16. The classification matrix of the boosting classifier
20.17. The error curve of the boosting classifier
20.18. Misclassification rates of the boosting classifier and the SVM
20.19. Classification matrices for the boosting classifier and the SVM
20.20. Cumulative response curves of the boosting classifier and the SVM
20.21. Response curves of the boosting classifier and the SVM comparing the baseline and the optimal classifiers
20.22. ROC curves of the boosting classifier and the SVM
21.1. List of items
21.2. The association rules as a function of the support and the reliability
21.3. Graph of lift values
21.4. List of association rules
22.1. The Aggregation dataset.
22.2. The setting of the Cluster operator.
22.3. The result of K-means clustering when K=7
22.4. The setting of the MacQueen clustering
22.5. The result of the MacQueen clustering
22.6. The result of the clustering with 8 clusters
22.7. The result display of the Cluster operator
22.8. Scatterplot of the cluster means
22.9. The decision tree of the clustering
22.10. Scatterplot of the Maximum Variance (R15) dataset
22.11. The result of the average linkage hierarchical clustering
22.12. Evaluating of the clustering by 3D bar chart
22.13. The result of Ward clustering
22.14. CCC plot of automatic clustering
22.15. Proximity graph of the automatic clustering
22.16. The Maximum Variance (D31) dataset
22.17. The result of automatic clustering
22.18. The CCC plot of automatic clustering
22.19. Az automatikus klaszterezés proximitási ábrája
22.20. The result of K-means clustering
22.21. The proximity graph of K-means clustering
22.22. The profile of the segments (clusters)
23.1. The dendrogram of attribute clustering
23.2. The graph of clusters and attributes
23.3. The cluster membership
23.4. The correlation plot of the attributes
23.5. The correlation between clusters and an attribute
23.6. Classification charts of SVM models
23.7. The response curve of SVM models
23.8. Az SVM modellek kumulatív lift függvényei
23.9. The ROC curves of SVM models
23.10. The scatterplot of the Maximum Variance (R15) dataset
23.11. The result of Kohonen's vector quantization
23.12. The pie chart of cluster size
23.13. Statistics of clusters
23.14. Graphical representation of the SOM
23.15. Scatterplot of the result of SOM
24.1. Classification matrix of the logistic regression
24.2. Effects plot of the logistic regression
24.3. Classification matrix of the stepwise logistic regression
24.4. Effects plot of the stepwise logistic regression
24.5. Fitting statistics for logistic regression models
24.6. Classification charts of the logistic regression models
24.7. Cumulativ lift curve of the logistic regression models
24.8. ROC curves of the logistic regression models
24.9. Classification matrix of the logistic regression
24.10. The classification chart of the logistic regression
24.11. The effects plot of the logistic regression
24.12. Statistics of the fitted models on the test dataset
24.13. Comparison of the fitted models by means of predictions
24.14. The observed and predicted means plot
24.15. The model scores
24.16. The decision tree for continuous target
24.17. The weights of neural network after traning
25.1. Statistics before and after filtering outliers
25.2. The predicted mean based on the two decision trees
25.3. The tree map of the best model
25.4. Comparison of the two fitted decision trees