Copyright © 2014 Faculty of Informatics, University of Debrecen

2014

**Table of Contents**

- Preface
- I. Data mining tools
- II. RapidMiner
- 3. Data Sources
- 4. Pre-processing
- 5. Classification Methods 1
- 6. Classification Methods 2
- 7. Classification Methods 3
- 8. Classification Methods 4
- Using a perceptron for solving a linearly separable binary classification problem
- Using a feed-forward neural network for solving a classification problem
- The influence of the number of hidden neurons to the performance of the feed-forward neural network
- Using a linear SVM for solving a linearly separable binary classification problem
- The influence of the parameter C to the performance of the linear SVM (1)
- The influence of the parameter C to the performance of the linear SVM (2)
- The influence of the parameter C to the performance of the linear SVM (3)
- The influence of the number of training examples to the performance of the linear SVM
- Solving the two spirals problem by a nonlinear SVM
- The influence of the kernel width parameter to the performance of the RBF kernel SVM
- Search for optimal parameter values of the RBF kernel SVM
- Using an SVM for solving a multi-class classification problem
- Using an SVM for solving a regression problem

- 9. Classification Methods 5
- Introducing ensemble methods: the bagging algorithm
- The influence of the number of base classifiers to the performance of bagging
- The influence of the number of base classifiers to the performance of the AdaBoost method
- The influence of the number of base classifiers to the performance of the random forest

- 10. Association rules
- 11. Clustering 1
- 12. Clustering 2
- 13. Anomaly detection

- III. SAS® Enterprise Miner™
- Bibliography

**List of Figures**

- 3.1. Metadata of the resulting ExampleSet.
- 3.2. A small excerpt of the resulting ExampleSet.
- 3.3. Metadata of the resulting ExampleSet.
- 3.4. A small excerpt of the resulting ExampleSet.
- 3.5. The resulting AML file.
- 3.6. A small excerpt of
*The World Bank: Population (Total)*data set used in the exepriment. - 3.7. Metadata of the resulting ExampleSet.
- 3.8. A small excerpt of the resulting ExampleSet.
- 3.9. Metadata of the resulting ExampleSet.
- 3.10. A small excerpt of the resulting ExampleSet.
- 4.1. Graphic representation of the global and kitchen power consumption in time
- 4.2. Possible outliers based on the hypothesized hbits of the members of the household
- 4.3. Filtering of the possible values using a record filter
- 4.4. Selection of aggregate functions for attributes
- 4.5. Preferences for dataset sampling
- 4.6. Preferences for dataset filtering
- 4.7. Resulting dataset after dataset sampling
- 4.8. Resulting dataset after dataset filtering
- 4.9. Defining a new attribute based on an expression relying on existing attributes
- 4.10. Properties of the operator used for removing the attributes made redundant
- 4.11. Selection of the attributes to remain in the dataset with reduced size
- 4.12. The appearance of the derived attribute in the altered dataset
- 4.13. Selection of the appropriate discretization operator
- 4.14. Setting the properties of the discretization operator
- 4.15. Selection of the appropriate weighting operator
- 4.16. Defining the weights of the individual attributes
- 4.17. Comparison of the weighted and unweighted dataset instances
- 5.1. Preferences for the building of the decision tree
- 5.2. Preferences for splitting the dataset into training and test sets
- 5.3. Setting the relative sizes of the data partitions
- 5.4. Graphic representation of the decision tree created
- 5.5. The classification of the records based on the decision tree
- 5.6. Setting a threshold for the maximal depth of the decision tree
- 5.7. Graphic representation of the decision tree created
- 5.8. Graphic representation of he classification of the records based on the decision tree
- 5.9. Graphic representation of the decision tree created with the increased maximal depth
- 5.10. Graphic representation of he classification of the records based on the decision tree with increased maximal depth
- 5.11. Graphic representation of the decision tree created with the further increased maximal depth
- 5.12. Graphic representation of he classification of the records based on the decision tree with further increased maximal depth
- 5.13. Preferences for the building of the decision tree
- 5.14. Graphic representation of the decision tree created
- 5.15. Graphic representation of he classification of the records based on the decision tree
- 5.16. Performance vector of the classification based on the decision tree
- 5.17. The modification of preferences for the building of the decision tree.
- 5.18. Graphic representation of the decision tree created with the modified preferences
- 5.19. Performance vector of the classification based on the decision tree created with the modified preferences
- 5.20. Settings for the sampling done in the validation operator
- 5.21. Subprocesses of the validation operator
- 5.22. Graphic representation of the decision tree created
- 5.23. Performance vector of the classification based on the decision tree
- 5.24. Settings of the cross-validation operator
- 5.25. Overall performance vector of the classifications done in the cross-validation operator
- 5.26. Overall performance vector of the classifications done in the cross-validation operator in the leave-one-out case
- 5.27. Preferences for the building of the decision tree based on the Gini-index criterion
- 5.28. Preferences for the building of the decision tree based on the gain ratio criterion
- 5.29. Graphic representation of the decision tree created based on the gain ratio criterion
- 5.30. Performance vector of the classification based on the decision tree built using the gain ratio criterion
- 5.31. Graphic representation of the decision tree created based on the Gini-index criterion
- 5.32. Performance vector of the classification based on the decision tree built using the Gini-index criterion
- 5.33. Settings of the operator for the comparison of ROC curves
- 5.34. Subprocess of the operator for the comparison of ROC curves
- 5.35. Comparison of the ROC curves of the two decision tree classifiers
- 6.1. The rule set of the rule-based classifier trained on the data set.
- 6.2. The classification accuracy of the rule-based classifier on the data set.
- 6.3. The rule set of the rule-based classifier.
- 6.4. The classification accuracy of the rule-based classifier on the training set.
- 6.5. The classification accuracy of the rule-based classifier on the test set.
- 6.6. The decision tree built on the data set.
- 6.7. The rule set equivalent of the decision tree.
- 6.8. The classification accuracy of the rule-based classifier on the data set.
- 7.1. Properties of the linear regression operator
- 7.2. The linear regression model yielded as a result
- 7.3. The class prediction values calculated based on the linear regression model
- 7.4. The subprocess of the classification by regression operator
- 7.5. The linear regression model yielded as a result
- 7.6. The class labels derived from the predictions calculated based on the regression model
- 7.7. The subprocess of the classification by regression operator
- 7.8. The linear regression model yielded as a result
- 7.9. The performance vector of the classification based on the regression model
- 7.10. The subprocess of the cross-validation by regression operator
- 7.11. The subprocess of the classification by regression operator
- 7.12. The linear regression model yielded as a result
- 7.13. The customizable properties of the cross-validation operator
- 7.14. The overall performance vector of the classifications done using the regression model defined in the cross-validation operator
- 7.15. The overall performance vector of the classifications done using the regression model defined in the cross-validation operator for the case of using the leave-one-out method
- 8.1. A linearly separable subset of the
*Wine*data set [UCI MLR] used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected). - 8.2. The decision boundary of the perceptron.
- 8.3. The classification accuracy of the perceptron on the data set.
- 8.4. The classification accuracy of the neural network on the data set.
- 8.5. The average classification error rate obtained from 10-fold cross-validation against the number of hidden neurons.
- 8.6. A linearly separable subset of the
*Wine*data set [UCI MLR] used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected). - 8.7. The kernel model of the linear SVM.
- 8.8. The classification accuracy of the linear SVM on the data set.
- 8.9. A subset of the
*Wine*data set used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected). Note that the classes are not linearly separable. - 8.10. The classification error rate of the linear SVM against the
value of the parameter
.`C`

- 8.11. The number of support vectors against the value of the
parameter
.`C`

- 8.12. The average classification error rate of the linear SVM obtained from
10-fold cross-validation against the value of the parameter
, where the horizontal axis is scaled logarithmically.`C`

- 8.13. The classification error rate of the linear SVM on the training and
the test sets against the value of the parameter
.`C`

- 8.14. The number of support vectors against the value of the parameter
.`C`

- 8.15. The classification error rate of the linear SVM on the training and the test sets against the number of training examples.
- 8.16. The number of support vectors against the number of training examples.
- 8.17. CPU execution time needed to train the SVM against the number of training examples.
- 8.18. The
*Two Spirals*data set - 8.19. The R code that produces the data set and is executed by
the
`Execute Script (R)`

operator of the R Extension. - 8.20. The classification accuracy of the nonlinear SVM on the data set.
- 8.21. The classification error rates of the SVM on the training and the test sets against the value of RBF kernel width parameter.
- 8.22. The optimal parameter values for the RBF kernel SVM.
- 8.23. The classification accuracy of the RBF kernel SVM trained on the entire data set using the optimal parameter values.
- 8.24. The kernel model of the linear SVM.
- 8.25. The classification accuracy of the linear SVM on the data set.
- 8.26. The optimal value of the
parameter for the RBF kernel SVM.`gamma`

- 8.27. The average RMS error of the RBF kernel SVM obtained from
10-fold cross-validation against the value of the parameter
, where the horizontal axis is scaled logarithmically.`gamma`

- 8.28. The kernel model of the optimal RBF kernel SVM.
- 8.29. Predictions provided by the optimal RBF kernel SVM against the values of the observed values of the dependent variable.
- 9.1. The average classification error rate of a single decision stump obtained from 10-fold cross-validation.
- 9.2. The average classification error rate of the bagging algorithm obtained from 10-fold cross-validation, where 10 decision stumps were used as base classifiers.
- 9.3. The average classification error rate obtained from 10-fold cross-validation against the number of base classifiers.
- 9.4. The average classification error rate obtained from 10-fold cross-validation against the number of base classifiers.
- 9.5. The average error rate of the random forest obtained from 10-fold cross-validation against the number of base classifiers.
- 10.1. List of the frequent item sets generated
- 10.2. List of the association rules generated
- 10.3. Graphic representation of the association rules generated
- 10.4. Operator preferences for the necessary data conversion
- 10.5. Converted version of the dataset
- 10.6. List of the frequent item sets generated
- 10.7. List of the association rules generated
- 10.8. Operator preferences for the appropriate data conversion
- 10.9. The appropriate converted version of the dataset
- 10.10. Enhanced list of the frequent item sets generated
- 10.11. List of the association rules generated
- 10.12. Graphic representation of the association rules generated
- 10.13. Operator preferences for the necessary data conversion
- 10.14. Label role assignment for performance evaluation
- 10.15. Prediction role assignment for performance evaluation
- 10.16. Operator preferences for the data conversion necessary for evaluation
- 10.17. Graphic representation of the association rules generated regarding survival
- 10.18. List of the association rules generated regarding survival
- 10.19. Performance vector for the application of association rules generated
- 10.20. List of the association rules generated regarding survival
- 10.21. Performance vector for the application of association rules generated
- 10.22. Contingency table of the dataset
- 10.23. Record filter usage
- 10.24. Removal of attributes that become redundant after filtering
- 10.25. List of the association rules generated for the subset of adults
- 10.26. Performance vector for the application of association rules generated regarding survival for the subset of adults
- 10.27. List of the association rules generated for the subset of children
- 10.28. Performance vector for the application of association rules generated regarding survival for the subset of children
- 11.1. The 7 separate groups
- 11.2. Clustering with default values
- 11.3. Set the distance function.
- 11.4. Clustering with Mahalanobis distance function
- 11.5. The dataset
- 11.6. Setting the parameters of the clustering
- 11.7. The clusters produced by the analysis
- 11.8. The groups with varying density
- 11.9. The results of the method with default parameters
- 11.10. The 15 group
- 11.11. The resulting dendrogram
- 11.12. The clustering generated from dendrogram
- 11.13. The 600 two-dimensional vectors
- 11.14. The subprocess
- 11.15. The report generated by the clustering
- 11.16. The output of the analysis
- 12.1. The two groups
- 12.2. Support vector clustering with polynomial kernel and
=0.21 setup`p`

- 12.3. Unsuccessful clustering
- 12.4. Clustering with RBF kernel
- 12.5. More promising results
- 12.6. The two groups containing 240 vectors
- 12.7. The subprocess of the optimalization node
- 12.8. The parameters of the optimalization
- 12.9. The report generated by the process
- 12.10. Clustering generated with the most optimal parameters
- 12.11. The 788 vectors
- 12.12. The evaluating subprocess
- 12.13. Setting up the parameters
- 12.14. Parameters to log
- 12.15. Cluster density against
number of clusters`k`

- 12.16. Item distribution against
number of clusters`k`

- 12.17. The vectors forming 31 clusters
- 12.18. The extracted centroids
- 12.19. The output of the k nearest neighbour method, using the centroids as prototypes
- 12.20. The preprocessing subprocess
- 12.21. The clustering setup
- 12.22. The confusion matrix of the results
- 13.1. Graphic representation of the possible outliers
- 13.2. The number of outliers detected as the distance limit grows
- 13.3. Nearest neighbour based operators in the Anomaly Detection package
- 13.4. Settings of LOF.
- 13.5. Outlier scores assigned to the individual records based on k nearest neighbours
- 13.6. Outlier scores assigned to the individual records based on LOF
- 13.7. Filtering the records based on their outlier scores
- 13.8. The dataset filtered based on the k-NN score
- 13.9. The dataset filtered based on the LOF score
- 13.10. Global settings for Histogram-based Outlier Score
- 13.11. Column-level settings for Histogram-based Outlier Score
- 13.12. Scores and attribute binning for fixed binwidth and arbitrary number of bins
- 13.13. Graphic representation of outlier scores
- 13.14. Scores and attributes binning for dynamic binwidth and arbitrary number of bins
- 13.15. Graphic representation of the enhanced outlier scores
- 14.1. The metadata of the dataset
- 14.2. Setting the
`Sample`

operator - 14.3. The metadata of the resulting dataset and a part of the dataset
- 14.4. The list of file in the
`File Import`

operator - 14.5. The parameters of the
`File Import`

operator - 14.6. A small portion of the dataset
- 14.7. The metadata of the resulting dataset
- 14.8. A small portion of the resulting dataset
- 15.1. Metadata produced by the
`DMDB`

operator - 15.2. The settings of
`Variable Selection`

operator - 15.3. List of variables after the selection
- 15.4. Sequential R-square plot
- 15.5. The binary target variables in a function of the two most important input attributes after the variable selection
- 15.6. Displaying the dataset by parallel axis
- 15.7. Explained cumulated explained variance plot of the PCA
- 15.8. Scatterplit of the Iris dataset using the first two principal components
- 15.9. The replacement wizard
- 15.10. The output of imputation
- 15.11. The relationship of an input and the target variable before imputation
- 15.12. The relationship of an input and the target variable after imputation
- 16.1. The settings of dataset partitioning
- 16.2. The decision tree
- 16.3. The response curve of the decision tree
- 16.4. Fitting statistics of the decision tree
- 16.5. The classification chart of the decision tree
- 16.6. The cumulative lift curve of the decision tree
- 16.7. The importance of attributes
- 16.8. The settings of parameters in the partitioning step
- 16.9. The decision tree using the chi-square impurity measure
- 16.10. The decision tree using the entropy impurity measure
- 16.11. The decision tree using the Gini-index
- 16.12. The cumulative response curve of decision trees
- 16.13. The classification plot
- 16.14. Response curve of decision trees
- 16.15. The score distribution of decision trees
- 16.16. The main statistics of decision trees
- 17.1. The misclassification rate of rule induction
- 17.2. The classification matrix of rule induction
- 17.3. The classification chart of rule induction
- 17.4. The ROC curves of rule inductions and decision tree
- 17.5. The output of the rule induction operator
- 18.1. Classification matrix of the logistic regression
- 18.2. Effects plot of the logistic regression
- 18.3. Classification matrix of the stepwise logistic regression
- 18.4. Effects plot of the stepwise logistic regression
- 18.5. Fitting statistics for logistic regression models
- 18.6. Classification charts of the logistic regression models
- 18.7. Cumulativ lift curve of the logistic regression models
- 18.8. ROC curves of the logistic regression models
- 18.9. Classification matrix of the logistic regression
- 18.10. The classification chart of the logistic regression
- 18.11. The effects plot of the logistic regression
- 19.1. A linearly separable subset of the
*Wine*dataset - 19.2. Fitting statistics for perceptron
- 19.3. The classification matrix of the perceptron
- 19.4. The cumulative lift curve of the perceptron
- 19.5. Fitting statistics for SVM
- 19.6. The classification matrix of SVM
- 19.7. The cumulative lift curve of SVM
- 19.8. List of the support vectors
- 19.9. Fitting statistics of the multilayer perceptron
- 19.10. The classification matrix of the multilayer perceptron
- 19.11. The cumulative lift curve of the multilayer perceptron
- 19.12. Weights of the multilayer perceptron
- 19.13. Training curve of the multilayer perceptron
- 19.14. Stepwise optimization statistics for
`DMNeural`

operator - 19.15. Weights of neural networks
`AutoNeural`

operátorral kapott háló neuronjainak súlyai - 19.16. Fitting statistics of neural networks
- 19.17. Classification charts of neural networks
- 19.18. Cumulative lift curves of neural networks
- 19.19. ROC curves of neural networks
- 19.20. Fitting statistics for linear kernel SVM
- 19.21. The classification matrix of linear kernel SVM
- 19.22. Support vectors (partly) of linear kernel SVM
- 19.23. The distribution of Lagrange multipliers for linear kernel SVM
- 19.24. The parameters of polynomial kernel SVM
- 19.25. Fitting statistics for polynomial kernel SVM
- 19.26. Classification matrix of polynomial kernel SVM
- 19.27. Support vectors (partly) of polynomial kernel SVM
- 19.28. Fitting statistics for SVM's
- 19.29. The classification chart of SVM's
- 19.30. Cumulative lift curves of SVM's
- 19.31. Comparison of cumulative lift curves to the baseline and the optimal one
- 19.32. ROC curves of SVM's
- 20.1. Fitting statistics of the ensemble classifier
- 20.2. The classification matrix of the ensemble classifier
- 20.3. The cumulative lift curve of the ensemble classifier
- 20.4. Misclassification rates of the ensemble classifier and the SVM
- 20.5. Classification matrices of the ensemble classifier and the SVM
- 20.6. Cumulative lift curves of the ensemble classifier and the SVM
- 20.7. Cumulative lift curves of the ensemble classifier, the SVM and the best theoretical model
- 20.8. ROC curves of the ensemble classifier and the SVM
- 20.9. The classification matrix of the bagging classifier
- 20.10. The error curves of the bagging classifier
- 20.11. Misclassification rates of the bagging classifier and the decision tree
- 20.12. Classification matrices of the bagging classifier and the decision tree
- 20.13. Response curves of the bagging classifier and the decision tree
- 20.14. Response curves of the bagging classifier and the decision tree comparing the baseline and the optimal classifiers
- 20.15. ROC curves of the bagging classifier and the decision tree
- 20.16. The classification matrix of the boosting classifier
- 20.17. The error curve of the boosting classifier
- 20.18. Misclassification rates of the boosting classifier and the SVM
- 20.19. Classification matrices for the boosting classifier and the SVM
- 20.20. Cumulative response curves of the boosting classifier and the SVM
- 20.21. Response curves of the boosting classifier and the SVM comparing the baseline and the optimal classifiers
- 20.22. ROC curves of the boosting classifier and the SVM
- 21.1. List of items
- 21.2. The association rules as a function of the support and the reliability
- 21.3. Graph of lift values
- 21.4. List of association rules
- 22.1. The
*Aggregation*dataset. - 22.2. The setting of the
`Cluster`

operator. - 22.3. The result of K-means clustering when K=7
- 22.4. The setting of the MacQueen clustering
- 22.5. The result of the MacQueen clustering
- 22.6. The result of the clustering with 8 clusters
- 22.7. The result display of the
`Cluster`

operator - 22.8. Scatterplot of the cluster means
- 22.9. The decision tree of the clustering
- 22.10. Scatterplot of the
*Maximum Variance (R15)*dataset - 22.11. The result of the average linkage hierarchical clustering
- 22.12. Evaluating of the clustering by 3D bar chart
- 22.13. The result of Ward clustering
- 22.14. CCC plot of automatic clustering
- 22.15. Proximity graph of the automatic clustering
- 22.16. The
*Maximum Variance (D31)*dataset - 22.17. The result of automatic clustering
- 22.18. The CCC plot of automatic clustering
- 22.19. Az automatikus klaszterezés proximitási ábrája
- 22.20. The result of K-means clustering
- 22.21. The proximity graph of K-means clustering
- 22.22. The profile of the segments (clusters)
- 23.1. The dendrogram of attribute clustering
- 23.2. The graph of clusters and attributes
- 23.3. The cluster membership
- 23.4. The correlation plot of the attributes
- 23.5. The correlation between clusters and an attribute
- 23.6. Classification charts of SVM models
- 23.7. The response curve of SVM models
- 23.8. Az SVM modellek kumulatív lift függvényei
- 23.9. The ROC curves of SVM models
- 23.10. The scatterplot of the
*Maximum Variance (R15)*dataset - 23.11. The result of Kohonen's vector quantization
- 23.12. The pie chart of cluster size
- 23.13. Statistics of clusters
- 23.14. Graphical representation of the SOM
- 23.15. Scatterplot of the result of SOM
- 24.1. Classification matrix of the logistic regression
- 24.2. Effects plot of the logistic regression
- 24.3. Classification matrix of the stepwise logistic regression
- 24.4. Effects plot of the stepwise logistic regression
- 24.5. Fitting statistics for logistic regression models
- 24.6. Classification charts of the logistic regression models
- 24.7. Cumulativ lift curve of the logistic regression models
- 24.8. ROC curves of the logistic regression models
- 24.9. Classification matrix of the logistic regression
- 24.10. The classification chart of the logistic regression
- 24.11. The effects plot of the logistic regression
- 24.12. Statistics of the fitted models on the test dataset
- 24.13. Comparison of the fitted models by means of predictions
- 24.14. The observed and predicted means plot
- 24.15. The model scores
- 24.16. The decision tree for continuous target
- 24.17. The weights of neural network after traning
- 25.1. Statistics before and after filtering outliers
- 25.2. The predicted mean based on the two decision trees
- 25.3. The tree map of the best model
- 25.4. Comparison of the two fitted decision trees