"Coming together is a beginning. Keeping together is progress. Working together is success."
Data Mining – Discovering Patterns in a Large Data Sets
What is Data Mining?
Each organization running business on daily basis which, at the same time, means production of data in order to support business processes. Systematic approach in definition, development and deployment of organization’s data strategy consider consolidation of data from multiple sources into enterprise or domain specific repository – data warehouse. Over time data warehouses becomes large, and relationships between data sets inside sometimes is not easy to discover. With standard data management techniques this relationships cold stay hidden for data consumers. The solution is to apply new data management technique that can “mine” these relationships – Data Mining.
Data mining is process of discovering actionable information from large sets of data. Data mining uses mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be discovered by traditional data exploration because the relationships are too complex or because there is too much data. Data Mining is derived from various disciplines such as: database technology, statistics, Information Science, Machine learning, Visualization, etc.
Typical Data Mining Scenarios includes:
- Forecasting sales.
- Targeting mailings toward specific customers.
- Determining which products are likely to be sold together.
- Finding sequences in the order that customers add products to a shopping cart.
Data mining process consists of an interactive sequence of the following steps:
- Data cleaning (to remove noise and inconsistent data)
- Data integration (where multiple data sources may be combined) – steps 1 and 2 resulting data stored ina data warehouse
- Data Selection (selection of data relevant to the analysis)
- Data Transformation (transformation of data into forms suitable for mining)
- Data Mining (an essential process where intelligent methods are applied in order to discover data patterns)
- Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interesting measures)
- Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
Data Mining Challenges
Mining different kinds of knowledge in databases
Different users are interested in different kinds of knowledge. That’s means that data mining should cover a wide spectrum of data analysis. These tasks may use the same database in different ways and require the development of numerous data mining techniques.
Interactive mining of knowledge at multiple levels of abstraction
Because it is difficult to know exactly what can be discovered within the database, the data mining process should be interactive. Interactive mining allows users to focus the search for patterns, providing and refining data mining requests based on returned results.
Incorporation of background knowledge
Background knowledge could be considered in two ways:
- Knowledge regarding to domain under study – helps for understanding results faster
- Knowledge regarding databases – helps for speedup process
Data mining query languages
Relational query languages (such as SQL) allow users to pose ad hoc queries for data retrieval. In a similar way, data mining query languages need to be developed to allow users processing of ad hoc data mining tasks.That include: specification of data sets for analysis, the domain knowledge, the kinds of knowledge to be mined, conditions and constraints to be enforced on the discovered patterns, etc.
Presentation and visualization of data mining results
Results of data mining should be expressed in pseudo languages, visual representations or other expressive forms. It is crucial for knowledge discovered to be easily understood from the information consumers. Especial is crucial if the data mining system is to be interactive.
Pattern evaluation
A data mining system can discover thousands of patterns. Many of the patterns discovered may be uninteresting to the given user. The challenge is to develop techniques able to assess the interestingness of discovered patterns.
Efficiency and scalability of data mining algorithms
Data mining algorithms must be efficient and scalable in order to be able to process large data sets. In other words, the running time of execution of data mining algorithms must be predictable and acceptable on large data sets.
Classification and prediction
Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Classification predicts categorical (discrete, unordered) labels. On the other way prediction models continuous-valued functions.
Data Mining Algorithms
With reference to algorithms using for data processing, there is a following categorization:
- Classification algorithms - predict one or more discrete variables, based on the other attributes in the dataset(e.g. Decision Trees Algorithm)
- Regression algorithms - predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset (e.g. Time Series Algorithm).
- Segmentation algorithms - divide data into groups, or clusters, of items that have similar properties (e.g. Clustering Algorithm).
- Association algorithms - find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis (e.g. Association Algorithm ).
- Sequence analysis algorithms - summarize frequent sequences or episodes in data, such as a Web path flow (e.g. Sequence Clustering Algorithm).
Data mining concepts
- The mining structure is a data structure that defines the data domain from which mining models are built.
- A data mining model applies a mining model algorithm to the data that is represented by a mining structure.
- Columns
- Discrete column contains a finite number of values with no continuum between the values. For example, a gender column is a typical discrete attribute column, in that the data represents a specific number of categories.
- Continuous column contains values that represent numeric data on a scale that allows interim values. Unlike a discrete column, which represents finite, countable data, a continuous column represents scalable measurements, and it is possible for the data to contain an infinite number of fractional values. A column of temperatures is an example of a continuous attribute column.
- Discrete column contains values that represent groups of values, known as buckets that are derived from a continuous column. The buckets are treated as ordered and discrete values. Discretization is the process of putting values of a continuous set of data into buckets so that there are a limited number of possible values. You can discrete both numeric and string
- Key column uniquely identifies a row
- A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other cluster. A cluster of data objects can be treated as one group and so may be considered as a form of data compression.
- A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on attribute, each branch represents an outcome of the test, and each leaf node holds a class label
- A time series database consists of sequences of values or events changing with time, typically measured at equal intervals.
- A sequence database consist of sequences of ordered elements of events recorded without a concrete notion of time
- Trend analysis decomposes time-series data into the trend (long-term) movements, cyclic movements, seasonal movements and irregular movements
Data mining appliance across the industries
Financial industry
Financial data collected in the banking and financial industry are usually relatively complete, reliable, and of high quality which facilitates systematic data analysis and data mining. Typical cases of data mining usage in financial services industry are listed below:
- Loan payment prediction and customer credit policy analysis
- Classification and clustering of customers for targeted marketing
- Detection of money laundering and other financial crimes
Retail industry
The retail industry is a major application area for data mining, since its collects huge amounts of data on sales, customer shopping history, goods transportation, consumption and service. The quantity of data collected continues to expand rapidly. Typical cases of data mining usage in retail industry are listed below:
- Multidimensional analysis of sales, customers, products, time and region
- Analysis of the effectiveness of marketing campaigns
- Customer retention – analysis of customer loyalty
- Cross sale and up sale product analysis
Telecommunication industry
The telecommunication industry has quickly evolved from offering local and long distance telephone services to providing many other comprehensive communication services, including fax, cellular phone, Internet messenger, email, etc. Typical cases of data mining usage in telecommunication industry are listed below:
- Multidimensional analysis of telecommunication data
- Mobile telecommunication services
- Use of visualization tools in telecommunication data analysis
References:
J. Han and M. Kamber, Data Mining Concepts and Techniques, Second edition (2006)

Comments
Post new comment