Creating and filtering attributes

Description

The process shows, using The Insurance Company Benchmark (COIL 2000) dataset, how new, computed attributes can be created based on existing data, in case the attributes are not appropriate in their original form, or some data derived from them is required. Furthermore, how the individual attributes can be removed from the dataset can also be seen in the process, as in these cases, if the raw data that form the basis of the calculation are not necessarily required later on, these columns can be removed from the dataset. Naturally, other columns can be filtered out as well, if they are not required for the solution of the given task, or if their disturbing effects are needed to be filtered out.

Input

The Insurance Company Benchmark (COIL 2000) [CoIL Challenge 2000]

Output

In the attributes of the dataset which begin with the letter m, the demographic data of the region belonging to the zip-code of the given potential client are present; among others, the distribution of individual income groups in the given region. If, for some reason, the original representation is to be compressed, it is possible to create a derived field based on these income attributes using a given formula, based on some heuristic, for example as follows:

Figure 4.9. Defining a new attribute based on an expression relying on existing attributes

Defining a new attribute based on an expression relying on existing attributes

After the appropriate computed field has been created, based on the given case, it can be decided whether the original fields used during the computation are required later on or not. It has to be considered whether these original data could be important for the creation of models in the future, or whether they could have some disturbing effect. The attributes of the raw data used for the computation, or any other arbitrarily selected attributes can be removed from the original dataset as follows:

Figure 4.10. Properties of the operator used for removing the attributes made redundant

Properties of the operator used for removing the attributes made redundant

Figure 4.11. Selection of the attributes to remain in the dataset with reduced size

Selection of the attributes to remain in the dataset with reduced size

Interpretation of the results

After executing these steps, all records will appear in the modified dataset, but with a modified attribute set. After the computed field has been created, this new attribute appears in every record, while the attributes filtered out disappear:

Figure 4.12. The appearance of the derived attribute in the altered dataset

The appearance of the derived attribute in the altered dataset

Video

Workflow

preproc_exp3.rmp

Keywords

derived attribute
attribute creation
attribute removal
attribute subset

Operators

Generate Attributes
Read AML
Select Attributes