DATA WAREHOUSE AND THE WEB
Professor Peter Drucker, the senior guru of management practice, has admonished IT
executives to look outside their enterprises for information. He remarked
that the single biggest challenge is to organize outside data because change
occurs from the outside. He predicted that the obsession with internal data
would lead to being blindsided by external forces.
The majority of data
warehousing efforts result in an enterprise focusing inward; however, the
enterprise should be keenly alert to its externalities. As markets become turbulent,
an enterprise must know more about its customers, suppliers, competitors, government
agencies, and many other external factors. The changes that take place in the
external environment, ultimately, get reflected in the internal data (and would
be detected by the various data analysis tools discussed in the later
sections), but by then it may be too late for the enterprise. Proactive action
is always better than reacting to external changes after the effects are felt.
The conclusion is that the information from internal systems must be enhanced
with external information. The synergism of the combination creates the
greatest business benefits.
The importance of external data and the
challenges faced in integrating external data with internally sourced data by
Load Manager. Some externally sourced data (particularly time sensitive data),
is often distributed through the internet.
Best Android App to Watch Free Movies & TV Shows on Your Smartphone: ShowBox Read More About News, Trends and Ideas that matter most to Entrepreneurs Visit SnapMunk
Reliability of Web Content
Many question the
reliability of web content, as they should. However, few analyze the
reliability issue to any depth. The Web is a global bulletin board on which
both the wise and foolish have equal space. Acquiring content from the Web
should not reflect positively or negatively on its quality.
Consider the following
situation: If you hear, “Buy IBM stock because it will double over the next
month,” your reaction should depend on who made that statement and in what
context. Was it a random conversation overheard on the subway, a chat with a
friend over dinner, or a phone call from a trusted financial advisor? The context
should also be considered when judging the reliability of Web content.
Think of Web resources in
terms of quality and coverage, as shown in Figure 1
Below :
Fig-1 : Web-based
Information Sources
Toward the top are
information resources of high quality (accuracy, currency, and validity), and
resources toward the right have a wide coverage (scope, variety, and diversity).
The interesting aspect of the web is that information resources occupy all quadrants.
In the upper center, the
commercial online database vendors traditionally have supplied business with
high-quality information about numerous topics. However, the complexity of
using these services and the infrequent update cycles have limited their usefulness.
More to the left,
governmental databases have become tremendously useful in recent years. Public
information was often available only by spending many hours of manual labour at
libraries or government offices. Recent developments like the Electronic Data
Gathering, Analysis, and Retrieval (EDGAR) database maintained by the U.S. Securities
and Exchange Commission provide valuable and up-to-date data via the Web.
At the left, corporate
Web sites often contain vast amounts of useful information in white papers,
product demos, and press releases, eliminating the necessity to attend trade
exhibits to learn the “latest and greatest” in a market place.
Finally, the “doubtful-quality
free” content occupies the lower half of the figure. Its value is not in the
quality of any specific item but in its constantly changing diversity. Combined
with the other Web resources, the doubtful-quality free content acts as a wide-angle
lens to avoid tunnel vision of the market place.
Web Farming
Like operational systems,
the Web farming system provides input to the data warehouse. The result
is to disseminate the refined information about specific business subjects to
the enterprise sourced from the Web.
The primary source of
content for the Web farming system is the Web because of its external
perspectives on the business of the enterprise. As a content source, the Web can
be supplemented (but not replaced) by the intranet web of the enterprise. This content
is typically in the format of internal Web sites, word processing documents, spreadsheets,
and e-mail messages. However, the content from the intranet is usually limited
to internal information about the enterprise, thus negating an important aspect
of Web farming.
Most information acquired
by the Web farming system will not be in a form suitable for the data
warehouse. Also, as discussed above, the source and quality of the content need
to be judged. In any case, the information must be refined before
loading into the warehouse. However, even in its unrefined state, the
information obtained from the Web, through Web farming, could be highly
valuable to the enterprise. The capability to directly disseminate this
information may be required via textual message alerts or “What’s New”
bulletins.
Refining Information
When a data warehouse is
first implemented within an enterprise, a detailed analysis and reengineering
of data from operational systems is required (see Section on Load Manager
above). The same is true for Web farming. Before Web content can be loaded into
a warehouse, the information must be refined.
The processes of
refining information consists of four steps:
Discovery, Acquisition, Structuring, and Dissemination.
Discovery is the exploration of available Web resources to find those
items that
relate to specific
topics. Discovery involves considerable detective work far beyond searching
generic directory services, such as Google,
Yahoo!, Bing or Alta Vista. Further, the discovery activity must be a
continuous process because data sources are continually appearing and disappearing
from the Web.
Acquisition is the collection and maintenance of content identified by its
source. The main goal of acquisition is to maintain the historical context so
that you can analyze content in the context of its past. A mechanism to
efficiently use human judgement in the validation of content is another key
requirement.
Structuring is the analysis and transformation of content into a more useful
format and into a more meaningful structure. The formats can be Web pages,
spreadsheets, word processing documents, and database tables. As we move toward
loading data into a warehouse, the structures must be compatible with the star-schema
design and with key identifier values.
Dissemination is the packaging and delivery of information to the appropriate
consumers, either directly or through a data warehouse. A range of
dissemination mechanisms is required, from predetermined schedules to ad hoc
queries. Newer technologies, such as information brokering and preference
matching, may be desirable.
0 comments:
Post a Comment