The process shows, using The Insurance Company Benchmark (COIL 2000) dataset, how new, computed attributes can be created based on existing data, in case the attributes are not appropriate in their original form, or some data derived from them is required. Furthermore, how the individual attributes can be removed from the dataset can also be seen in the process, as in these cases, if the raw data that form the basis of the calculation are not necessarily required later on, these columns can be removed from the dataset. Naturally, other columns can be filtered out as well, if they are not required for the solution of the given task, or if their disturbing effects are needed to be filtered out.
The Insurance Company Benchmark (COIL 2000) [CoIL Challenge 2000]
In the attributes of the dataset which begin with the letter m, the demographic data of the region belonging to the zip-code of the given potential client are present; among others, the distribution of individual income groups in the given region. If, for some reason, the original representation is to be compressed, it is possible to create a derived field based on these income attributes using a given formula, based on some heuristic, for example as follows:
After the appropriate computed field has been created, based on the given case, it can be decided whether the original fields used during the computation are required later on or not. It has to be considered whether these original data could be important for the creation of models in the future, or whether they could have some disturbing effect. The attributes of the raw data used for the computation, or any other arbitrarily selected attributes can be removed from the original dataset as follows:
After executing these steps, all records will appear in the modified dataset, but with a modified attribute set. After the computed field has been created, this new attribute appears in every record, while the attributes filtered out disappear: