Management for All: DATA MINING

Wednesday, July 23, 2014

DATA MINING

According to Berry and Linoff, Data Mining is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules. This definition, justifiably, raises the question: how does data mining differ from OLAP? OLAP (Online Analytical Processing) is undoubtedly a semiautomatic means of analyzing data, but the main difference lies in quantities of data that can be handled.

There are other differences as well. Tables 1 and 2 summarize these differences.

Table-1 : OLAP Vs Data Mining – Past Vs Future

OLAP: Report on the past	Data Mining: Predict the future
Who are our top 100 best customers for the last three years?	Which 100 customers offer the best profit potential?
Which customers defaulted on the mortgages last in two years?	Which customers are likely to be bad credit risks?
What were the sales by territory last quarter compared to the targets?	What are the anticipated sales by territory and region for next year?
Which salespersons sold more than their quota during last four quarters?	Which salespersons are expected to exceed their quotas next year?
Last year, which stores exceeded the total prior year sales?	For the next two years, which stores are likely to have best performance?
Last year, which were the top five promotions that performed well?	What is the expected return for next year’s promotions?
Which customers switched to other phone companies last year?	Which customers are likely to switch to the competition next year?

Table-2 : Differences between OLAP and Data Mining

FEATURES	OLAP	DATA MINING
Motivation for Information request	What is happening in the enterprise?	Predict the future based only why this is happening
Data granularity	Summary data	Detailed transaction-level data.
Number of business Dimensions	Limited number of dimensions	Large number of dimensions.
Number of dimension Attributes	Small number of attributes	Many dimension attributes
Sizes of datasets for the dimensions	Not large for each dimension	Usually very large for each dimension
Analysis approach	User-driven, interactive analysis	Data-driven automatic knowledge discovery
Analysis techniques	Multidimensional, drill-down, and slice-and-dice	Prepare data, launch mining tool and sit back
State of the Technology	Mature and widely used	Still emerging; some parts of the technology more mature

Why Now?

Why is data mining being put to use in more and more businesses? Here are some basic reasons:

• In today’s world, an organization generates more information in a week than most people can read in a lifetime. It is humanly impossible to study, decipher, and interpret all that data to find useful patterns.

• A data warehouse pools all the data after proper transformation and cleansing into well-organized data structures. Nevertheless, the sheer volume of data makes it impossible for anyone to use analysis and query tools to discern useful patterns.

In recent times, many data mining tools suitable for a wide range of applications have appeared in the market. The tools and products are now mature enough for business use.

• Data mining needs substantial computing power. Parallel hardware, databases, and other powerful components are available and are becoming very affordable.

• Organizations are placing enormous emphasis on building sound customer relationships, and for good reasons. Companies want to know how they can sell more to existing customers. Organizations are interested in determining which of their customers will prove to be of long-term value to them. Companies need to discover any existing natural classifications among their customers so that the each such class may be properly targeted with products and services. Data mining enables companies to find answers and discover patterns in their customer data.

• Finally, competitive considerations weigh heavily on organizations to get into data mining. Perhaps competitors are already using data mining.

Data Mining Techniques

Data mining covers a broad range of techniques. Each technique has been heavily researched in recent years, and several mature and efficient algorithms have evolved for each of them. The main techniques are: Cluster detection, Decision trees, Memory based reasoning, Link analysis, Rule induction, Association rule discovery, Outlier detection and analysis, Neural networks, Genetic algorithms, and Sequential pattern discovery. Discussion on the algorithms associated with the various techniques has not mentioned here for two main reasons:

firstly, because they are too mathematical / technical in nature, and secondly, because there are numerous, well written text books, to serve the needs of those who are specially interested in the subject. Table-3 below summarized the important features of some of these techniques. The model structure refers to how the technique is perceived, not how it is actually implemented. For example, a decision tree model may actually be implemented through SQL statements. In the framework, the basic process is the process performed by the particular data mining technique. For example, the decision trees perform the process of splitting at decision points. How a technique validate the model is important. In the case of neural networks, the technique does not contain a validation method to determine termination. The model calls for processing the input records through the different layers of nodes and terminate the discovery at the output node.

Table 3 : Summary of Data Mining Techniques

Data Mining Technique	Underlying Structure	Basic Process	Validation Method
Cluster Detection	Distance calculation in n-vector space	Grouping of values in the same neighbourhood	Cross Validation to Verify Accuracy
Decision Trees	n-ary Tree	Splits at decision points based on entropy	Cross Validation
Memory-based Reasoning	Predictive Structure Based on Distance and Combination Functions	Association of unknown instances with known instances	Cross Validation
Link Analysis	Graphs	Discover links among variables by their values	Not Applicable
Neural Networks	Forward Propagation Network	Weighted inputs of predictors at each node	Not Applicable
Genetic Algorithms	Fitness Functions	Survival of the fittest on mutation of derived values	Mostly Cross Validation

Data Mining Applications

Data mining technology encompasses a rich collection of proven techniques that cover a wide range of applications in both the commercial and non-commercial realms. In some cases, multiple techniques are used, back to back, to greater advantage. For instance, a cluster detection technique to identify clusters of customers may be followed by a predictive algorithm applied to some of the identified clusters to discover the expected behaviour of the customers in those clusters.

Non-commercial use of data mining is strong and pervasive in the research area. In oil exploration and research, data mining techniques discover locations suitable for drilling based on potential mineral and oil deposits. Pattern discovery and matching techniques have military applications in assisting to identify targets. Medical research is a field ripe for data mining. The technology helps researchers with discoveries of correlations between diseases and patient characteristics. Crime investigation agencies use the technology to connect criminal profiles to crimes. In astronomy and cosmology, data mining helps predict cosmetic events.

The scientific community makes use of data mining to a moderate extent, but the technology has widespread applications in the commercial arena. Most of the tools target the commercial sector. Consider the following list of a few major applications of data mining in the business area.

Customer Segmentation: This is one of the most widespread applications. Businesses use data mining to understand their customers. Cluster detection algorithms discover clusters of customers sharing the same characteristics. Market Basket Analysis: This very useful application for the retail industry. Association rule algorithms uncover affinities between products that are bought together. Other businesses such as upscale auction houses use these algorithms to find customers to whom they can sell higher-value items.

Risk Management: Insurance companies and mortgage businesses use data mining to uncover risks associated with potential customers.

Fraud Detection: Credit card companies use data mining to discover abnormal spending patterns of customers. Such patterns can expose fraudulent use of the cards.

Delinquency Tracking: Loan companies use the technology to track customers who are likely to default on repayments.

Demand Prediction: Retail and other businesses use data mining to match demand and supply trends to forecast for specific products.

Table 4 : Application of Data Mining Techniques

Application Area	Examples of Mining Functions	Mining Processes	Mining Techniques
Fraud Detection	Credit Card Frauds Internal Audits Warehouse Pilferage	Determination of Variation from Norms	Data Visualization Memory-based Reasoning Outlier Detection and Analysis
Risk Management	Credit Card Upgrades Mortgage Loans Customer Retention Credit Rating	Detection and Analysis of Association Affinity Grouping	Decision Trees Memory Based Reasoning Neural Networks
Market Analysis	Market basket analysis Target marketing Cross selling Customer Relationship Management	Predictive Modeling Database Segmentation	Cluster Detection Decision Trees Association Rules Genetic Algorithms

Linkbar

Management for All

Subscribe through E-mail