Discretizing and weighting attributes

Description

The process shows, using a sample of the Individual household electric power consumption dataset, how an attribute that takes its values from an interval of real numbers can be discretized, i.e. converted to discrete values that represent defined subintervals of the real interval. Furthermore, it can also be seen in the process, how weights can be added to the individual data columns, if when using a data minding procedure, it is necessary to distinguish between the individual data regarding their importance, and not to allow all attributes take part in the data mining algorithm, and in the conclusions based on it with equal weights.

Input

Individual household electric power consumption [UCI MLR]

Output

In the dataset, the usage of discretization is shown using the variable Global_active_power. This variable represents the total energy consumption in the whole of the household at a given moment, so following the changes of the times of day, these values also change in a cyclic fashion. Thus if the total consumption is to be represented with discrete values, and not real numbers, in order to be used in a given method, then this column can be properly discretized. Discretization can be done using different operators, by defining the size of the categories (the number of elements in them), or the number of categories, and based on this number, either categories of equal size, or ones of equal element numbers can be created, for example as follows:

Figure 4.13. Selection of the appropriate discretization operator

Selection of the appropriate discretization operator

Figure 4.14. Setting the properties of the discretization operator

Setting the properties of the discretization operator

Furthermore, using given methods, to receive a result or decision that is appropriate for the requirements later on, it has to be defined, which attributes have what level of importance - the simplest way to do this is weighting. In order to be able to weight the attributes, the weights themselves have to be created first, and then they have to be applied to the dataset. For example, such weights can be set manually for this dataset, whit which it can be indicated that the globally measured values are of most importance, the submeterings are of less importance, and the date and time values are of the least importance, as follows:

Figure 4.15. Selection of the appropriate weighting operator

Selection of the appropriate weighting operator

Figure 4.16. Defining the weights of the individual attributes

Defining the weights of the individual attributes

Interpretation of the results

After executing these steps, the value of the variable Global_active_power will be modified in all the records of the dataset. Here it can be seen that the division into intervals has been done, but behind the discrete values, the interval the values falling into which are corresponding to the given value are displayed as well. In addition, by comparing the original and the modified dataset (the weighted dataset can be seen on the left, and the unweighted dataset can be seen on the right, in their state after the discretization has been executed), it can also be seen that the numeric values in the individual columns have been altered according to the weighting (as the normalize weights option is turned on, the greatest weight is considered 1, thus the values of the columns to which this weight has been assigned are not subject to change, and the values of the columns to which smaller weights have been assigned decrease proportionally):

Figure 4.17. Comparison of the weighted and unweighted dataset instances

Comparison of the weighted and unweighted dataset instances

Video

Workflow

preproc_exp4.rmp

Keywords

attribute discretization
attribute weighting
weighting
discretization

Operators

Discretize by Binning
Multiply
Read CSV
Scale by Weights
Weight by User Specification