Table of Contents
The process shows, using a sample of the Individual household electric power consumption data set, how to manage such data sets in which duplicate, inconsistent, and/or duplicate values are present. The missing values can be substituted with a default value, or one computed based on the other instances of the field, or if it is necessary, the records belonging to them can be deleted. After defining the inconsistent values, the records belonging to these can also be filtered out, but however, to define these, it is usually necessary to have some background knowledge or knowledge in the given field. On the contrary, filtering duplicate values is a rather automated task, the data identical to each other can be filtered out easily.
Individual household electric power consumption [UCI MLR]
The dataset used here is a sample from the original dataset that encompasses a longer time period; it only contains the energy consumption data from January 2007. Normally the dataset contains a measurement for every minute, but if a given measurement was not executed for some reason, the timestamp is present without data. Such missing values can be substituted with defined values, e.g. with the average of the values present in the given attribute, or a decision to leave out the records connected to them can be made. However, dealing with inconsistent values is a more complex issue. In given fields, intervals can be defined into which the values of the attributes can fall, but in other cases, one must relay on other kinds of background information. For example, let us suppose that the members of the household, in which the measurements took place, do not fly with the owls. Based on this, consider the following representation of the data:
On the figure, colors are assigned based on the values of the variable Sub_metering_1, which represents the energy consumption of the tools in the kitchen, and as the x axis is time, it can be seen that some of the outstanding kitchen consumption values have been measured late at night. This can also be seen in the data view, if the data are ordered according to the kitchen consumption values:
If the members of the household are indeed not flying with the owls, these data can be considered inconsistent based on our background knowledge, and they can be filtered out as follows. Formulating our condition, let us assume that if the values of a measurement in the kitchen exceeds 50 Wh at a point of time after 10 p.m., this is considered a piece of inconsistent data. Based on this, the filtering condition can easily be defined, but first, the time attribute has to be converted, as by default, it is stored in a nominal variable, in the format hh:mm:ss, which can only be compared for equality. Using the appropriate operators, it can be splitted into the components hour, minute, and second, interpreted as numbers. The filtering condition can be defined using the Time_1 variable from among these, which contains the hour component:
Using such filters, the records belonging to inconsistent data can be filtered out, and also, by using the appropriate operator, duplicate records can be filtered out as well - it can also be defined based on the equality of which attribute set they are to be considered duplicate - from the dataset, and after this, the actual processing of the filtered and/or refined records can begin.