DATA WAREHOUSE
To facilitate strategic
decision-making, we need a new breed of information delivery environment,
called a data warehouse. The concept of a data warehouse given by Bill Inmon,
the father of data warehousing, is depicted in Figure-1.
Figure -1 : What is Data Warehousing ?
Subject-orientation: Data warehouse data are arranged and optimized to provide
answers to questions coming from diverse functional area within a company.
Therefore, the data warehouse contains data organized and summarized by topic,
such as sales, marketing, finance, distribution, and transportation. For each
one of these topics the data warehouse contains specific subjects of interest -
products, customers, departments, regions, promotions, and so on. Note that
this form of data organization is quite different from the more functional or
process-oriented organization of typical transaction systems.
Time-variancy: We have already noted that the DSS data include a time element
(see Table-1). In contrast to the operational data, which focus on current
transactions, the warehouse data represent the flow of data through time. The
data warehouse can even contain projected data generated through statistical
and other models.
Non-volatility: Once data enter the data warehouse they are never removed. Because
the data in the data warehouse represent the company’s entire history, the
operational data representing the near-tern history, are always added to it. Because
data are never deleted and new data are always added, the data warehouse is
always growing. That is why the DSS DBMS must be able to support multi-gigabyte
and even multi-terabyte database and multiprocessor hardware.
Integration: The data warehouse is a centralized, consolidated
database that integrates data derived from the entire organization. Thus the
data warehouse consolidates data from multiple and diverse sources with diverse
formats. Data integration implies a well-organized effort to define and
standardize all data elements. This integration effort can be time-consuming
but, once accomplished, it provides a unified view of the overall
organizational situation. Data integration enhances decision-making and helps
managers to better understand the company’s operations. This understanding can
be translated into recognition of strategic business opportunities.
Table-1 summarizes the differences between the data in a data warehouse
and that in an operational database.
Table-1
: Operational Data and Data Warehouse Data
CHARCTERISTIC
|
OPERATIONAL
DATABASE DATA
|
DATA
WAREHOUSE DATA
|
Integrated
|
Similar data can have
different representations or meanings. For example, telephone numbers may be
stored as 033-29-70701 or as 0332970701, and a given condition may be labelled
as T/F or 0/1 or Y/N. A sales value may be shown in thousands or in millions.
|
Provide a unified view
of all data elements with a common
definition and representation
for all business units.
|
Subject-oriented
|
Data are stored with a
functional, or process, orientation. For example, data may be stored for
invoices, payments, credit amounts, and so on.
|
Data are stored with a
subject orientation that facilities, multiple views of the data and facilitates
decision making .For example, sales may be recorded by product, by division,
by manager, or by region.
|
Time-variant
|
Data are recorded as
current transactions. For example, the sales data may be the sale of a product
on a given date, such as Rs. 342.78 on 12-AUG-2012.
|
Data are recorded with a
historical perspective in mind. Therefore, a time dimension is added to
facilitate data analysis and various time comparisons
|
Non-volatile
|
Data updates are
frequent and common. For example, an inventory amount changes with each sale.
Therefore, the data environment is fluid.
|
Data cannot be changed.
Data are only added periodically from historical systems. Once the data are
properly stored, no changes are allowed. Therefore the data environment is
relatively static.
|
THE LOAD MANAGER
Data flows into the data
warehouse through the load manager. The data is mostly extracted
from the operational database(s) and other internal sources (like archived historical
data), and supplemented by data imported from external sources. Externally sourced
data can greatly enhance the value of information generated from a data warehouse.
For example Transco, the gas pipeline operator in UK, uses weather forecast
data from the British Met Office on a regular basis to determine demand for gas
(the main source of energy used for heating homes and offices) in various areas
of the country. The weather data is fed into a model that incorporates
several other factors (e.g. day of the week, internal data about customers’
usage patterns, demographic and economic profile data, alternate sources of
energy, types of buildings in the area) to arrive at a demand forecast. Types
of data from external sources that may be included in data warehouse are:
financial indicators and statistics of the industry, market share data of competitors,
demographic data, weather data, credit worthiness data, readership / viewer
survey data for advertising media, specially commissioned surveys and so on.
External data is usually obtained from commercial database services or
government agencies (e.g. Equifax, Reuters, Met Office, census agency, industry
associations, stock exchanges, local government statistics service). The data
from such diverse sources will obviously be in different incompatible formats
and will be distributed through various media. Some of them may be available on
a downloadable format on the Internet; others may be distributed on CD-ROMs,
while some may only be available on printed media. Some data may be available
for free but most data (particularly when used for commercial purposes) have to
be purchased.
The load manager
primarily performs what is termed an Extract-Transform-Load (ETL) operation.
-Data
Extraction
-Data
Transformation
-Data Loading
Data Extraction: This function has to deal
with numerous data sources. Appropriate techniques have to be employed for each
data source. Source data may be from different source machines in diverse data
formats. Part of the source data may be in relational database systems. Some
data may be on other legacy network and hierarchical data models. Many data
sources may still be in flat files. There may also be the need to include data
from spread sheets and local departmental data sets. Data extraction can become
quite a complex operation at times.
Various tools are
available on the market for data extraction. Use of outside tools may be
considered suitable for certain data sources. For the other data sources,
inhouse programs may need to be developed to do the data extraction. Purchasing
outside tools may entail high initial costs. In-house programs, on the other
hand, may mean ongoing costs for development and maintenance.
After extraction, the
data needs to be kept somewhere for further preparation. Sometimes the
extraction function is performed in the legacy platform itself if that approach
suits the designed framework. More frequently, data warehouse implementation
teams extract the source data into a separate physical environment from which
moving the data into the data warehouse would be easier. In the separate environment,
the source data may be extracted into a group of flat files, or an intermediate
relational database, or a combination of both. This physical environment is
called the data-staging area.
Data Transformation: In every system
implementation, data conversion is an important function. For example, when
implementing an operational system such as a magazine subscription application,
the database has to be initially populated with data from the existing system
records. The conversion may either be from a manual system or from a
file-oriented system to a modern system supported with relational database
tables. In either case, the data will need to be converted from the existing systems.
So, what is so different for a data warehouse? Why is data transformation for a
data warehouse more involved than that for an operational system?
As already discussed,
data for a data warehouse comes from many disparate sources. If data extraction
for a data warehouse poses great challenges, data transformation presents even
greater challenges. Another factor in the data warehouse is that the data feed
is not just an initial one-time load. The ongoing changes will have to continue
to be picked up from the source systems. Any transformation tasks are set up
for the initial load will have to be adapted for the ongoing revisions as well.
A number of individual
tasks are performed as part of data transformation. First, the data extracted
from each source is cleaned. Cleaning may be correction of misspellings, or may
include resolutions of conflicts between state codes and pin codes in the
source data, or may deal with providing default values for missing data elements,
or elimination of duplicates when the same data is brought in from
multiple source systems.
Standardization of data elements forms a large part of data transformation. The
data types and field lengths for same data elements retrieved from the various
sources need to be standardized. Semantic standardization is another major
task. Synonyms and homonyms have to be resolved. Resolution of
synonyms is required when two or more terms from different source systems mean
the same thing. On the other hand, when a single term means many different
things in different source systems, resolution of homonyms have to be
performed.
Data transformation
involves many forms of combining pieces of data from the different sources. In
some cases, data from a single source record or related data elements from many
source records are combined. In other situations, data transformation may also
involve purging source data that is not useful and/or separating out source
records into new combinations. During data transformation sorting and merging
of data takes place on a large scale in the data staging area.
In many cases, the keys
chosen for the operational systems are field values with built-in meanings. For
example, the product key value may be a combination of characters indicating
the product category, the code of the warehouse where the product is stored,
and some code to show the production batch. Primary keys in the data warehouse
cannot have built-in meanings. Therefore, data transformation also includes the
assignment of surrogate keys derived from the source system primary keys.
A grocery chain
point-of-sale operational system keeps the unit sales and revenue amounts by
individual transactions at the checkout counter at each store. But in the data
warehouse, it may not be necessary to keep the data at this detailed level. It may
be more appropriate to summarize the totals by product at each store for a given
day and keep the summary totals of the sale units and revenue in the data warehouse’s
storage. In such cases, the data transformation function would include such
summarization processing.
The end result of the
data transformation function is a collection of integrated data that is
cleaned, standardized, and summarized. Now the data is ready to be loaded into
each data set in the data warehouse.
Data Loading: Two distinct groups of
tasks form the data loading function. After completion of the design and
construction of the data warehouse, when it goes live for the first time, the
initial loading of data is done. The initial load moves large volumes of data
and takes substantial amount of time, but it is a one-time effort. As the data
warehouse starts functioning, extraction of additions (and changes) to the source
data continues on an ongoing basis, together with the transformation and loading
operations.
THE QUERY MANAGER
The query manager provides
an interface between the data warehouse and its users. It performs tasks like
directing the queries to the appropriate tables, generating views on an ad-hoc
basis if required, monitoring the effectiveness of indexes and summary data,
and query scheduling.
Data Warehouse Design Considerations
The key considerations
involved in the design of a data warehouse are:
-Time
Span
-Granularity
-Dimensionality
-Aggregations
-Partitioning
Time span: Operational data
represent current (atomic) transactions. Such transactions might define a
purchase order, a sales invoice, an inventory movement, and so on. In short,
operational data cover a short time frame. In contrast, data warehouse data
tend to cover a longer time frame. Managers are seldom interested in a specific
sales invoice to customer X; rather they tend to focus on sales generated
during the last month, the last year, or the last five years. Rather than
concern themselves with a single customer purchase, they might be interested in
the buying pattern of such a customer or groups of customers. In short, data
warehouse data tend to be historic in nature. That is, the data warehouse data
represent company transactions up to a given point in time, yesterday, last
week, last month, and the like. The time period for which data is held in the
data warehouse is determined by the data analysis requirements of the users of
the data warehouse. These needs, in turn, arise from the changes in the
business environment that a particular organization needs to monitor, in its
effort to stay ahead of its competitors. Since, a data warehouse’s size depends
on the span of time for which data is stored, the time span covered by the data
warehouse is an important design consideration. If, for example, the
environment changes rapidly, the data required for analysis would relate more
often to the recent past, rather than that over several years or decades.
In that case the designers of the data warehouse need to consider whether or
not the cost incurred in holding data for indefinitely long time spans would be
worthwhile.
Granularity: According to Inmon,
the single most important design aspect of a data warehouse is the decision on
granularity. It refers to the level of detail or summarization available in
units of data in the data warehouse. The more detail there is, the lower the
level of granularity. The less detail there is, the higher the level of
granularity.
Operational data
represent specific transactions that occur at a given time, such as customer
purchase of product X in store A. Thus, granularity is taken for granted to be
of the lowest level, in operational systems. Data warehouse data must be presented
at different levels of aggregation, from highly summarized to near atomic. This
requirement is based on the fact that managers at different levels in the organization
require data with different levels of aggregation. It is also possible that a single
problem requires data with different summarization levels. For example, if a manager
must analyze sales by region, (s)he must be able to access data showing the sales
by region, by city within the region, by store within the city within the
region, and so on. In this case, the manager requires summarized data to
compare the regions, but (s)he also needs data in a structure that enables him
or her to decompose (drill down) the data into more atomic components (that is,
data at lower levels of aggregation). For example, it is necessary to be able
to drill down to the stores within the region in order to compare store
performance by region. Granularity level in a data warehouse cannot, therefore,
be assumed.
The decision on
granularity level profoundly affects both the volume of data that resides in
the data warehouse, and the type of query that can be answered. A trade off
exists between the volume of data in the data warehouse and the level of detail
of queries. Some data warehouses are designed to support dual granularity.
In such environments some data (usually the most recent) is held at a
relatively low level of granularity, while the rest is held in more summarized
form (i.e. at a higher granularity level). This enables detailed analysis at
the same time allows reduction of data volume.
Dimensionality: This is probably the most
distinguishing characteristic of a data warehouse. From the data analyst’s
point of view, the data is always related in many different ways. For example,
when we analyze product sales by a customer during a given time span, we are
likely to ask how many widgets of type X were sold to customer Y during the
last six months. In fact, the question tends to expand quickly to include many
different data dimensions. For instance, we might want know how the
product X fared relative to product Z during the past six months, by region,
state, city, store, and customer (or sales of various products by quarters by
country). In this case, both place and time are part of the picture. In
general, data analysis tends to include many data dimensions, producing a
multidimensional view of the data.
The data model used for
modelling data warehouses is known as the dimensional model. The
numerical measurements related to the business (like sales volumes) are stored
in fact tables. The descriptions of the dimensions are stored in dimension
tables. The number and types of dimensions and facts that are to be stored
in a data warehouse is a very important design decision, and (much like the
decision on granularity) affects both the data volume and the types of analysis
that can be supported.
Aggregations: We have seen how data analysis queries directed at data warehouses
involve dimensions. Another very common type of query directed at data
warehouses involves sums of values along the different dimensions. For example:
what is the total sales volume of LAPTOP during the past 4 quarters? Answering this query using the fact and
dimension tables would involve summing up the individual sales volume figures
over the 4 quarters and the 3 counties. In real situations, similar queries
might involve retrieving and summing hundreds or thousands of individual
values. To avoid excessive processing load on the data warehouse arising from
frequently asked queries of this type, it is often decided, at design time, to
store some pre-calculated aggregations, along with the base facts,
in the data warehouse. This decision affects the data volume and performance of
certain types of queries.
Partitioning: One of the essences of the data warehouse is flexible data
storage, management, and access. When data resides in large physical units,
among other things it cannot be:
-indexed
easily
-sequentially scanned, if
needed
-restructured easily
-backed up conveniently
-recovered easily
-monitored easily
In short, having a big
mass of data defeats much of the purpose of the data warehouse. The purpose of partitioning
is to break the data into smaller (more manageable) physical units of
storage. The criteria used for dividing the data can be: date, line of business
/ product category, geography / location, organizational / administrative unit,
or any combination of these.
0 comments:
Post a Comment