The data mining is an interdisciplinary area of information technology, one of the most important parts of the so-called KDD (Knowledge Discovery from Databases) process. It consists of such computationally intensive algorithms, methods which are capable to explore patterns from relatively large datasets that represent well-interpretable information for further use. The applied algorithms originate from a number of fields, namely, artificial intelligence, machine learning, statistics, and database systems. Moreover, the data mining combine the results of these areas and it evolves in interaction of them today too. In contrast to focus merely to data analysis, see, for example, the statistics, data mining contains a number of additional elements, including the datamanegement and data preprocessing, moreover, such post-processing issues as the interesting metrics or the suitable visualization of the explored knowledge.
The use of the word data mining has become very fashionable and many people mistakenly use it for all sorts of information processing involving large amounts of data, e.g., simple information extraction or data warehouse building), but it also appears in the context of decision support systems. In fact, the most important feature of data mining is the exploration or the discovery that is to produce something new, previously unknown and useful information for the user. The term of data mining has emerged in the '60s, when the statisticians used it in negative context when one analyzes the data without any presupposition. In the information technology it appeared first in the database community in the '90s in the context of describing the sophisticated information extraction. Although, the termof data mining is spread in the business, more synonym exists, for example, knowledge discovery. It is important to distinguish between data mining and the challenging Big Data problems nowadays. The solution of Big Data problems usually does not require the development of new theoretical models or methods, the problem is rather that the well-working algorithms of data mining softwares hopelessly slow down when you want to process a really large volumes of data as a whole instead of a reasonable sample size. This obviously requires a special attitude and IT infrastructure that is outside the territory of the present curriculum.
The data mining activity, in automatic or semi-automatic way, is integrated into the IT infrastructure of the organization which applies it. This means that we can provide newer and newer information for the users by data mining tools from the ever-changing data sources, typically from data warehouses, with relatively limited human intervention. The reason is that because the (business) environment is constantly changing, following the changes of the data warehouse which collects the data from the environment. Hence, the previously fitted data mining models lose their validity, new models may need to model the altered data. Data mining softwares increasingly support this approach in such a way that they are able to operate in very heterogeneous environments. The collaboration between the information service and the supporting analytics nowadays allows the development of real-time analytics based online systems, see, e.g., the recommendation systems.
The data mining is organized around the so-called data mining process, followed by the majority of data mining softwares. In general, this is a five-step process where the steps are as follows:
Sampling, data selection;
The data mining softwares provide operators for these steps and we can carry out certain operations with them, for example, reading an external file, filtering outliers, or fitting a neural network model. Representing the data mining process by a diagram in a graphical interface, there are nodes corresponding to these operators. Examples of this process are the SEMMA methodology of the SAS Institute Inc.® which is known about its information delivery softwares and the widely used Cross Industry Standard Process for Data Mining (CRISP-DM) methodology, which has evolved by cooperating of many branch of industry, e.g., finance, automotive, information technology, etc.
During the sampling the target database is formed for the data mining process. The source of the data in most cases is an enterprise (organization) data warehouse or its subject-oriented part, a so-called datamart. Therefore, the data obtained from here have gone through a pre-processing phase in general, when they move from the operational systems into the data warehouse, and thus they can be considered to be reliable. If this is not the case then the used data mining software provides tools for data cleaning, which, in this case, can already be considered as the second step of the process. Sampling can generally be done using an appropriate statistical method, for example, simple random or stratified sampling method. Also in this step the dataset is partitioned to training, validating, and testing set. On the training dataset the data mining model is fitted and its parameters isestimated. The validating dataset is used to stop the convergence in the training process or compare different models. By this method, using independent dataset from the training dataset, we obtain reliable decision where to stop the training. Finally, on the test data set generalization ability of the model can be measured, that is how it will be expected to behave in case of new records.
The exploring step means the acquaintance with the data without any preconception if it is possible. The objective of the exploring step is to form hypotheses to establish in connection with the applicable procedures. The main tools are the descriptive statistics and graphical vizualization. A data mining software has is a number graphical tools which exceed the tools of standard statistical softwares. Another objective of the exploring is to identify any existing errors (noise) and to find the places of missing data.
The purposes of modifying is the preparation of the data for fitting a data mining model. There may be several reasons for this. One of them is that many methods require directly the modification of the data, for example, in case of neural networks the attributes have to be standardized before the training of the network. An other one is that if a method does not require the modification of the data, however a better fitting model is obtained after suitable modification. An example is the normalization of the attributes by suitable transformations before fitting a regression model in order the input attributes wiil be close to the normal distribution. The modification can be carried out at multiple levels: at the level of entire attributes by transforming whole attributes, at the level of the records, e.g., by standardizing some records or at the level of the fields by modifying some data. The modifying step also includes the elimination of noisy and missing data, the so-called imputation as well.
The modeling is the most complex step of the data mining process and it requires the
largest knowledge as well. In essence, we solve here, after suitable preparation, the data mining task.
The typical data mining tasks can be divided into two groups. The first group is known as supervised
data mining and supervised training. In this case, there is an attribute with a special role in the dataset
which is called target. The target variable should be indicated in the used data mining software. Our task
then is to describe this target variables by using the other variables as well as we can. The second group
is known as unsupervised data mining or unsupervised learning. In this case, there is no special attributes
in the analyzed dataset, where we want to explore hidden patterns. Within the data mining,
6 task types can be defined, from which the classification and the regression are
supervised data mining and the segmentation, association, sequential analysis, and anomaly detection
are unsupervised data mining.
Classification: modelling known classes (groups) for generalization purpose in order to apply the built model for new records. Example: filtering emails by classifying them for spam and no-spam classes.
Regression: building a model which approximates a continuous target by a function of input attributes such that the error of this approximation is as small as possible. Example: estimating customer value by current demographic and historical data.
Segmenting, clustering: finding in a sense similar groups in the data without taking into account any known existing structure. A typical example is the customer segmentation when a bank or insurance company is looking for groups of clients behaving similarly.
Association: searching relationships between attributes. Typical example is the basket analysis, when we look at what goods are bought by the customers in the stores and supermarkets.
Anomaly detection: identifying such records, which may be interesting or require further investigation due to a mistake. Example is searching extremely behaved clients, users.
Sequential analysis: searching temporal and spatial relationships between attributes. For example, in which order are the services took by the customers or examining gene sequences.
The assessing of the results is the last step in the data mining process, which objective is to decide whether truly relevant and useful knowledge is reached by the data mining process. Namely, it often happens that such model is produced by the improper use of the data mining which has weak generalization ability and the model works very poorly on new data. This is the so-called overfitting. In order to avoid the overfitting we should lean on the training, validating, and testing dataset. At this step, we can also compare our fitted models if there are more than one. In the comparison various measurements, e.g., misclassification rate, mean square error, and graphical tools, e.g., lift curve, ROC curve, can be used.
This electronic curriculum aims at providing an introduction to data mining applications, so that it shows these applications in practice through the use of data mining softwares. Problems requiring data mining can be encountered in many fields of life. Some of these are listed below, datasets used in the course material also came from these areas.
Commercial data mining. One of the main driving forces behind the development and application of data mining. Its objective is to analyze the static, historical business data stored in data warehouses in order to explore hidden patterns and trends. Besides the standard way of collecting data, companies found out several other ways to build more reliable data mining models, for example, this is also one of the main reasons behind the spreading of loyalty cards. Among a number of specific application areas we emphasize the customer relationship management (CRM): who are our customers and how to deal with them, the churn analysis: which customers are planning to leave us, the cross-selling: what products should be offered together. The algorithms of market basket analysis have been born in solving a business problem.
Scientific data mining. The other main driving force behind the development of data mining. Many data mining methods, for example, neural networks and self-organizing map, have been developed for solving a scientific problem and they became a method of data mining only years later. The application areas are ranging from the astronomy (galaxy classification and processing various radiation detected in space), chemistry (forecasting, the properties of artificial molecules), and engineering sciences (material science, traffic management) to biology (bioinformatics, drug discovery, and genetics). Data mining can help in areas where the problem of data gap appears, i.e., far more data is generated than the scientist is able to process.
Mining medical data. The development of health information technology makes possible for doctors to share their diagnostic results with each other and thus it enables not to repeate doing an examination several times. Moreover, by collecting the diagnosis as the results of examination in a common data warehouse, it will be possible to develope new medical procedures by means of data mining techniques. Data mining is also likely to play an important role in personalized medicine as well.
Spatial data mining. Analysis of spatial data with data mining methods, the extension of the traditional geographic information systems (GIS) with data mining tools. Application areas: climate research, the spread of epidemics, customer analysis of large multinational companies taking into account the spatial dimension. An important area in the future will be the processing of data generated in sensor networks, e.g., pollution monitoring on an area.
Multimédia adatbányászat. Analysis of audio, image and video files by data mining tools. Data mining can help to find similarities between songs in order to decide copyright issues more objectively. Another application is finding copyright law conflicting or illegal contents in file-sharing systems and in multimedia service providers.
Web mining. Analysis of web data. Three types of web mining problems are distinguished: web structure mining, web content mining, and web usage mining. The web structure mining means the examination of the structure of the Web, i.e., the examination of the web-graph, where the set of vertices consists of the sites and the set of edges consists of the links between the sites. The web content mining means the retrieval of useful information from the contents on the web. The well-known web search engines (Google, AltaVista, etc) also carry out this thing. The web usage mining deals with examining what users are searching for on the Internet, using the the data gathered by Web servers and application servers. These areas are strongly related to Big Data problems because it is needed to work many times over an Internet-scale infrastructure.
Text mining. Mining of unstructured or semi-structured data. Under unstructured data we mean continuous texts (sequences of strings), which may be connected by a theme (e.g., scienctific), by a field (e.g., sport), or they can be customer's sentiments at a customer service. Semi-structured data are typically produced by computers or files produced for computers, for example, in XML or JSON format. Some specific applications: data mining for security reasons, e.g., searching for terrorists, analytical CRM, sentiment analysis, and academic applications (plagiarism investigation).
The RapidMiner and SAS® Enterprise Miner™ workflows presented in this course material are contained in the file resources/workflows.zip.
Data files used in the experiments must be downloaded by the user from the location specified in the text. After importing a workflow file paths must be set to point to the local copies of data files (absolute paths are required).