Data Mining

Definition

Data mining is an interdisciplinary branch of computer science with the goal of using statistical and mathematical methods to extract patterns, correlations, or developmental tendencies from a data set. The information is then transformed into an understandable structure for further use. Data mining is the analysis step within a process called "Knowledge Discovery in Databases".

Application areas of data mining

Text mining is an application of data mining which uses statistical and linguistic methods to capture and visually process information from natural-language and unstructured sources.

Another application of data mining is financial data analysis, which can predict loan repayment probability, analyze creditworthiness, classify and cluster customers for targeted marketing, and detect money laundering and other financial crimes.

In marketing and commerce, data mining is used to evaluate large amounts of data in the fields of sales, purchase history, goods transport, consumption, and services. Data mining helps to identify customer buying patterns and trends through multidimensional analysis of sales, customers, products, time, and region. This information can be used to improve customer service and strengthen customer loyalty and satisfaction.

Data mining is also used in intrusion detection systems for networks. Intrusion refers to any action that threatens the integrity, confidentiality, or availability of network resources. With the increased use of the internet and availability of tools to invade and attack networks, intrusion detection through the analysis of large amounts of data has become an important part of network administration.

Methods of data mining

Data mining uses different methods for data analysis depending on the database and the requested information.

Tracking Patterns

Recognizing patterns in data sets is one of the most basic techniques in data mining. Pattern recognition can reveal repetitions, regularities, and especially conspicuous deviations in data sets. This helps with detecting fraudulent activities or, as in the case of Crime Analytics, making predictions about the next crime scene.

Classification

Classification means that items from data collections are categorized. This is useful, for instance, to assign low, medium, or high credit risk to bank customers. Based on this information, a credit institute might calculate the interest rate on a loan, for example.

Association

Association analysis is about uncovering hidden data relationships by searching data for events that are correlated with another event. Examples include the joint purchase of different products by a customer or increased sales of certain products before public holidays or during major sporting events.

Outlier detection

Outlier detection is used to detect anomalies in data sets. This can be used, for example, to find out why individual products are more in demand on specific weekdays or occasions than the rest of the time.

Clustering

Clustering in data mining refers to a process that creates classes with similar objects from a set of abstract objects in a database. With clustering you can, for example, group customers with similar purchasing behavior.

Regression

Regression analysis or correlation analysis is a method used in statistics to analyze a variable's dependence on changes in other variables. For example, regression analysis can show the dependency of a product price on the availability of the product or on a changed competitive situation.

Predictive analytics

Predictive analytics provides a method for creating new data models based on historical data. One use of these models could be predicting future purchasing behavior or the development of a business unit.

Data mining and big data

Big data refers to very large volumes of structured, semi-structured, and unstructured data. Usually, data volumes greater than or equal to 1 TB are called big data. In general, three basic characteristics - volume, velocity, and variety - are used to identify big data. Volume describes the amount of data, velocity is the speed at which the data is generated, and variety is an expression of the diversity of data types and sources that make up big data. As with the analysis of smaller data sets, data mining allows you to extract useful information from big data.

Possible problems with data mining

The information obtained through data mining is only as reliable as the underlying data. Poor data quality with meaningless data (noisy data), missing, inaccurate, or false values, as well as insufficient amounts of data can lead to misinterpretations. The integration of contradictory or redundant data from different sources such as multimedia files, geodata, texts, or social media can also lead to problems when evaluating it.

Data protection and data security are also well-known problem areas. For example, if legal requirements are not met, data mining can lead to serious problems regarding data security, data protection, and governance. You also need to ensure that your customers' data is protected from unauthorized third-party access.