The process shows, using a sample of the Individual household electric power consumption dataset, how the data can be summed up using aggregation or sampled if not all of the individual records are required during the given process. Aggregation can be used if the individual data are not necessary, but the values computed from the whole of the dataset are required, and sampling can be done if generally only a fraction of the dataset is required, and conclusions are to be derived based on this subset of the data.
Individual household electric power consumption [UCI MLR]
When aggregating, all the aggregate functions available in SQL can be used, and using these, basic statistics can easily be computed for the data of the given dataset.
If sampling is done on the dataset, this can be done by explicitly specifying the size of the sample, or based on probability, and also a filter can be used, in the case when the parts of the dataset are not to be represented proportionally in the sample, rather a given subset of the original dataset is necessary for the process. For example, filtering for the records belonging to a given time on every day can be done as follows:
After performing the aggregation or sampling, the received dataset will only consist of the aggregate values emerging as a result of the specified operations, or the records that fulfil the specified conditions, respectively: