Chapter 15. Preprocessing

Table of Contents

Constructing metadata and automatic variable selection
Vizualizing multidimensional data and dimension reduction by PCA
Replacement and imputation

Constructing metadata and automatic variable selection

Description

The process illustrates, by using the Spambase dataset, how to generate the metadata of a dataset by the DMDB operator, then how automatic variable selection can be obtained by the Variable Selection operator. The Spambase dataset contains 58 attributes, one of which is the binary target. In order to visualize a dataset, it may be necessary to determine the most important input attributes which can be used in the graphical representation.

Input

Spambase [UCI MLR]

Output

The DMDB operator produces such metadata (descriptive statistics) as mean, variance, minimum, maximum, skewness, and kurtosis. In case of discrete attributes these are complemented by the mode.

Figure 15.1. Metadata produced by the DMDB operator

Metadata produced by the DMDB operator

The default settings of the Variable Selection operator are applied except that the minimum R-square is increased in order to filter the unnecessary attributes.

Figure 15.2. The settings of Variable Selection operator

The settings of Variable Selection operator

The result on the one hand will be a list which contains the decision about the variables, i.e., whether it remains or not in the data mining process, on the other hand, a few graphs of the importance of the variables.

Figure 15.3. List of variables after the selection

List of variables after the selection

Figure 15.4. Sequential R-square plot

Sequential R-square plot

In view of the important variables a number of graphical tools of the Enterprise Miner ™ can be used to display the records.

Figure 15.5. The binary target variables in a function of the two most important input attributes after the variable selection

The binary target variables in a function of the two most important input attributes after the variable selection

Interpretation of the results

The experiment shows how metadata can be extracted from SAS datasets which we can then transmit to other operators. Moreover, we demonstrated how can variable selection be performed in case of large number of attributes and how can we be working with the important attributes.

Video

Workflow

sas_preproc_exp1.xml

Keywords

variable selection
metadata

Operators

Data Source
Data Mining DataBase
Graph Explore
Variable Selection