Preface

Tartalom

How to use this material

The data mining is an interdisciplinary area of information technology, one of the most important parts of the so-called KDD (Knowledge Discovery from Databases) process. It consists of such computationally intensive algorithms, methods which are capable to explore patterns from relatively large datasets that represent well-interpretable information for further use. The applied algorithms originate from a number of fields, namely, artificial intelligence, machine learning, statistics, and database systems. Moreover, the data mining combine the results of these areas and it evolves in interaction of them today too. In contrast to focus merely to data analysis, see, for example, the statistics, data mining contains a number of additional elements, including the datamanegement and data preprocessing, moreover, such post-processing issues as the interesting metrics or the suitable visualization of the explored knowledge.

The use of the word data mining has become very fashionable and many people mistakenly use it for all sorts of information processing involving large amounts of data, e.g., simple information extraction or data warehouse building), but it also appears in the context of decision support systems. In fact, the most important feature of data mining is the exploration or the discovery that is to produce something new, previously unknown and useful information for the user. The term of data mining has emerged in the '60s, when the statisticians used it in negative context when one analyzes the data without any presupposition. In the information technology it appeared first in the database community in the '90s in the context of describing the sophisticated information extraction. Although, the termof data mining is spread in the business, more synonym exists, for example, knowledge discovery. It is important to distinguish between data mining and the challenging Big Data problems nowadays. The solution of Big Data problems usually does not require the development of new theoretical models or methods, the problem is rather that the well-working algorithms of data mining softwares hopelessly slow down when you want to process a really large volumes of data as a whole instead of a reasonable sample size. This obviously requires a special attitude and IT infrastructure that is outside the territory of the present curriculum.

The data mining activity, in automatic or semi-automatic way, is integrated into the IT infrastructure of the organization which applies it. This means that we can provide newer and newer information for the users by data mining tools from the ever-changing data sources, typically from data warehouses, with relatively limited human intervention. The reason is that because the (business) environment is constantly changing, following the changes of the data warehouse which collects the data from the environment. Hence, the previously fitted data mining models lose their validity, new models may need to model the altered data. Data mining softwares increasingly support this approach in such a way that they are able to operate in very heterogeneous environments. The collaboration between the information service and the supporting analytics nowadays allows the development of real-time analytics based online systems, see, e.g., the recommendation systems.

The data mining is organized around the so-called data mining process, followed by the majority of data mining softwares. In general, this is a five-step process where the steps are as follows:

The data mining softwares provide operators for these steps and we can carry out certain operations with them, for example, reading an external file, filtering outliers, or fitting a neural network model. Representing the data mining process by a diagram in a graphical interface, there are nodes corresponding to these operators. Examples of this process are the SEMMA methodology of the SAS Institute Inc.® which is known about its information delivery softwares and the widely used Cross Industry Standard Process for Data Mining (CRISP-DM) methodology, which has evolved by cooperating of many branch of industry, e.g., finance, automotive, information technology, etc.

During the sampling the target database is formed for the data mining process. The source of the data in most cases is an enterprise (organization) data warehouse or its subject-oriented part, a so-called datamart. Therefore, the data obtained from here have gone through a pre-processing phase in general, when they move from the operational systems into the data warehouse, and thus they can be considered to be reliable. If this is not the case then the used data mining software provides tools for data cleaning, which, in this case, can already be considered as the second step of the process. Sampling can generally be done using an appropriate statistical method, for example, simple random or stratified sampling method. Also in this step the dataset is partitioned to training, validating, and testing set. On the training dataset the data mining model is fitted and its parameters isestimated. The validating dataset is used to stop the convergence in the training process or compare different models. By this method, using independent dataset from the training dataset, we obtain reliable decision where to stop the training. Finally, on the test data set generalization ability of the model can be measured, that is how it will be expected to behave in case of new records.

The exploring step means the acquaintance with the data without any preconception if it is possible. The objective of the exploring step is to form hypotheses to establish in connection with the applicable procedures. The main tools are the descriptive statistics and graphical vizualization. A data mining software has is a number graphical tools which exceed the tools of standard statistical softwares. Another objective of the exploring is to identify any existing errors (noise) and to find the places of missing data.

The purposes of modifying is the preparation of the data for fitting a data mining model. There may be several reasons for this. One of them is that many methods require directly the modification of the data, for example, in case of neural networks the attributes have to be standardized before the training of the network. An other one is that if a method does not require the modification of the data, however a better fitting model is obtained after suitable modification. An example is the normalization of the attributes by suitable transformations before fitting a regression model in order the input attributes wiil be close to the normal distribution. The modification can be carried out at multiple levels: at the level of entire attributes by transforming whole attributes, at the level of the records, e.g., by standardizing some records or at the level of the fields by modifying some data. The modifying step also includes the elimination of noisy and missing data, the so-called imputation as well.

The modeling is the most complex step of the data mining process and it requires the largest knowledge as well. In essence, we solve here, after suitable preparation, the data mining task. The typical data mining tasks can be divided into two groups. The first group is known as supervised data mining and supervised training. In this case, there is an attribute with a special role in the dataset which is called target. The target variable should be indicated in the used data mining software. Our task then is to describe this target variables by using the other variables as well as we can. The second group is known as unsupervised data mining or unsupervised learning. In this case, there is no special attributes in the analyzed dataset, where we want to explore hidden patterns. Within the data mining, 6 task types can be defined, from which the classification and the regression are supervised data mining and the segmentation, association, sequential analysis, and anomaly detection are unsupervised data mining.

The assessing of the results is the last step in the data mining process, which objective is to decide whether truly relevant and useful knowledge is reached by the data mining process. Namely, it often happens that such model is produced by the improper use of the data mining which has weak generalization ability and the model works very poorly on new data. This is the so-called overfitting. In order to avoid the overfitting we should lean on the training, validating, and testing dataset. At this step, we can also compare our fitted models if there are more than one. In the comparison various measurements, e.g., misclassification rate, mean square error, and graphical tools, e.g., lift curve, ROC curve, can be used.

This electronic curriculum aims at providing an introduction to data mining applications, so that it shows these applications in practice through the use of data mining softwares. Problems requiring data mining can be encountered in many fields of life. Some of these are listed below, datasets used in the course material also came from these areas.

How to use this material

The RapidMiner and SAS® Enterprise Miner™ workflows presented in this course material are contained in the file resources/workflows.zip.

Fontos

Data files used in the experiments must be downloaded by the user from the location specified in the text. After importing a workflow file paths must be set to point to the local copies of data files (absolute paths are required).