Sampling and aggregation


The process shows, using a sample of the Individual household electric power consumption dataset, how the data can be summed up using aggregation or sampled if not all of the individual records are required during the given process. Aggregation can be used if the individual data are not necessary, but the values computed from the whole of the dataset are required, and sampling can be done if generally only a fraction of the dataset is required, and conclusions are to be derived based on this subset of the data.


Individual household electric power consumption [UCI MLR]


When aggregating, all the aggregate functions available in SQL can be used, and using these, basic statistics can easily be computed for the data of the given dataset.

Figure 4.4. Selection of aggregate functions for attributes

Selection of aggregate functions for attributes

If sampling is done on the dataset, this can be done by explicitly specifying the size of the sample, or based on probability, and also a filter can be used, in the case when the parts of the dataset are not to be represented proportionally in the sample, rather a given subset of the original dataset is necessary for the process. For example, filtering for the records belonging to a given time on every day can be done as follows:

Figure 4.5. Preferences for dataset sampling

Preferences for dataset sampling

Figure 4.6. Preferences for dataset filtering

Preferences for dataset filtering

Interpretation of the results

After performing the aggregation or sampling, the received dataset will only consist of the aggregate values emerging as a result of the specified operations, or the records that fulfil the specified conditions, respectively:

Figure 4.7. Resulting dataset after dataset sampling

Resulting dataset after dataset sampling

Figure 4.8. Resulting dataset after dataset filtering

Resulting dataset after dataset filtering





data filtering


Filter Examples
Read CSV