DATA MINING
According to Berry and
Linoff, Data Mining is the exploration and analysis, by automatic or
semiautomatic means, of large quantities of data in order to discover meaningful
patterns and rules. This definition,
justifiably, raises the question: how does data mining differ from OLAP? OLAP
(Online Analytical Processing) is undoubtedly a semiautomatic means of analyzing
data, but the main difference lies in quantities of data that can be handled.
There are other
differences as well. Tables 1 and 2 summarize these differences.
Table-1 : OLAP Vs Data Mining – Past Vs Future
OLAP: Report on the past
|
Data Mining: Predict the future
|
Who are our top 100 best customers for the last three years?
|
Which 100 customers offer the best profit potential?
|
Which customers defaulted on the mortgages last in two years?
|
Which customers are likely to be bad credit risks?
|
What were the sales by territory last quarter compared to the
targets?
|
What are the anticipated sales by territory and region for
next year?
|
Which salespersons sold more than their quota during last four
quarters?
|
Which salespersons are expected to exceed their quotas next
year?
|
Last year, which stores exceeded the total prior year sales?
|
For the next two years, which stores are likely to have best
performance?
|
Last year, which were the top five promotions that performed
well?
|
What is the expected return for next year’s promotions?
|
Which customers switched to other phone companies last year?
|
Which customers are likely to switch to the competition next
year?
|
Table-2 : Differences between OLAP and Data Mining
FEATURES
|
OLAP
|
DATA
MINING
|
Motivation for
Information request
|
What is happening in the
enterprise?
|
Predict the future based only why this is happening
|
Data granularity
|
Summary data
|
Detailed transaction-level data.
|
Number of business
Dimensions
|
Limited number of dimensions
|
Large number of dimensions.
|
Number of dimension
Attributes
|
Small number of attributes
|
Many dimension attributes
|
Sizes of datasets for the dimensions
|
Not large for each dimension
|
Usually very large for each dimension
|
Analysis approach
|
User-driven, interactive analysis
|
Data-driven automatic
knowledge discovery
|
Analysis techniques
|
Multidimensional, drill-down, and slice-and-dice
|
Prepare data, launch mining tool and sit back
|
State of the
Technology
|
Mature and widely used
|
Still emerging; some parts of the technology more mature
|
Why Now?
Why is data mining being
put to use in more and more businesses? Here are some basic reasons:
• In today’s world, an organization generates more information in
a week than most people can read in a lifetime. It is humanly impossible to
study, decipher, and interpret all that data to find useful patterns.
• A data warehouse pools all the data after proper transformation
and cleansing into well-organized data structures. Nevertheless, the sheer
volume of data makes it impossible for anyone to use analysis and query tools
to discern useful patterns.
In recent times, many
data mining tools suitable for a wide range of applications have appeared in
the market. The tools and products are now mature enough for business use.
• Data mining needs substantial computing power. Parallel
hardware, databases, and other powerful components are available and are
becoming very affordable.
• Organizations are placing enormous emphasis on building sound
customer relationships, and for good reasons. Companies want to know how they
can sell more to existing customers. Organizations are interested in
determining which of their customers will prove to be of long-term value to
them. Companies need to discover any existing natural classifications among
their customers so that the each such class may be properly targeted with
products and services. Data mining enables companies to find answers and
discover patterns in their customer data.
• Finally, competitive considerations weigh heavily on
organizations to get into data mining. Perhaps competitors are already using data
mining.
Data Mining Techniques
Data mining covers a
broad range of techniques. Each technique has been heavily researched in recent
years, and several mature and efficient algorithms have evolved for each of
them. The main techniques are: Cluster detection, Decision
trees, Memory based reasoning, Link analysis,
Rule induction, Association rule discovery, Outlier
detection and analysis, Neural networks, Genetic
algorithms, and Sequential pattern discovery. Discussion on the
algorithms associated with the various techniques has not mentioned here for
two main reasons:
firstly, because they are
too mathematical / technical in nature, and secondly, because there are
numerous, well written text books, to serve the needs of those who are specially
interested in the subject. Table-3 below summarized the important features
of some of these techniques. The model structure refers to how the technique is
perceived, not how it is actually implemented. For example, a decision tree
model may actually be implemented through SQL statements. In the framework, the
basic process is the process performed by the particular data mining technique.
For example, the decision trees perform the process of splitting at decision
points. How a technique validate the model is important. In the case of neural
networks, the technique does not contain a validation method to determine
termination. The model calls for processing the input records through the
different layers of nodes and terminate the discovery at the output node.
Table 3 : Summary of Data
Mining Techniques
Data
Mining Technique
|
Underlying
Structure
|
Basic
Process
|
Validation
Method
|
Cluster Detection
|
Distance
calculation in n-vector space
|
Grouping
of values in the same neighbourhood
|
Cross
Validation to Verify Accuracy
|
Decision Trees
|
n-ary Tree
|
Splits
at decision points
based on entropy
|
Cross Validation
|
Memory-based
Reasoning
|
Predictive
Structure Based on Distance and Combination Functions
|
Association
of unknown
instances
with known
instances
|
Cross Validation
|
Link Analysis
|
Graphs
|
Discover
links among
variables by their values
|
Not Applicable
|
Neural Networks
|
Forward
Propagation
Network
|
Weighted
inputs of
predictors at each node
|
Not Applicable
|
Genetic Algorithms
|
Fitness Functions
|
Survival
of the fittest on
mutation of derived values
|
Mostly
Cross
Validation
|
Data Mining Applications
Data mining technology
encompasses a rich collection of proven techniques that cover a wide range of
applications in both the commercial and non-commercial realms. In some cases,
multiple techniques are used, back to back, to greater advantage. For instance,
a cluster detection technique to identify clusters of customers may be followed
by a predictive algorithm applied to some of the identified clusters to
discover the expected behaviour of the customers in those clusters.
Non-commercial use of
data mining is strong and pervasive in the research area. In oil exploration
and research, data mining techniques discover locations suitable for drilling
based on potential mineral and oil deposits. Pattern discovery and matching techniques
have military applications in assisting to identify targets. Medical research is
a field ripe for data mining. The technology helps researchers with discoveries
of correlations between diseases and patient characteristics. Crime
investigation agencies use the technology to connect criminal profiles to
crimes. In astronomy and cosmology, data mining helps predict cosmetic events.
The scientific community
makes use of data mining to a moderate extent, but the technology has
widespread applications in the commercial arena. Most of the tools target the
commercial sector. Consider the following list of a few major applications of
data mining in the business area.
Customer Segmentation: This is one of the most widespread applications. Businesses use
data mining to understand their customers. Cluster detection algorithms
discover clusters of customers sharing the same characteristics. Market
Basket Analysis: This very useful application for the retail industry. Association
rule algorithms uncover affinities between products that are bought together.
Other businesses such as upscale auction houses use these algorithms to find
customers to whom they can sell higher-value items.
Risk Management: Insurance companies and mortgage businesses use data mining to
uncover risks associated with potential customers.
Fraud Detection: Credit card companies use data mining to discover abnormal spending
patterns of customers. Such patterns can expose fraudulent use of the cards.
Delinquency Tracking: Loan companies use the technology to track customers who are
likely to default on repayments.
Demand Prediction: Retail and other businesses use data mining to match demand and
supply trends to forecast for specific products.
Table 4 : Application of
Data Mining Techniques
Application
Area
|
Examples
of Mining Functions
|
Mining
Processes
|
Mining
Techniques
|
Fraud Detection
|
Credit
Card Frauds
Internal
Audits
Warehouse Pilferage
|
Determination
of
Variation from Norms
|
Data
Visualization
Memory-based
Reasoning
Outlier Detection and Analysis
|
Risk Management
|
Credit
Card Upgrades Mortgage Loans Customer Retention
Credit Rating
|
Detection
and Analysis
of
Association
Affinity Grouping
|
Decision
Trees
Memory
Based Reasoning
Neural Networks
|
Market Analysis
|
Market
basket analysis
Target
marketing Cross selling Customer Relationship
Management
|
Predictive
Modeling
Database Segmentation
|
Cluster
Detection
Decision
Trees
Association
Rules
Genetic Algorithms
|
0 comments:
Post a Comment