Custom Search

Popular Posts

Saturday, July 19, 2014

DATA WAREHOUSE AND THE WEB


Professor Peter Drucker, the senior guru of management practice, has admonished IT executives to look outside their enterprises for information. He remarked that the single biggest challenge is to organize outside data because change occurs from the outside. He predicted that the obsession with internal data would lead to being blindsided by external forces. 

The majority of data warehousing efforts result in an enterprise focusing inward; however, the enterprise should be keenly alert to its externalities. As markets become turbulent, an enterprise must know more about its customers, suppliers, competitors, government agencies, and many other external factors. The changes that take place in the external environment, ultimately, get reflected in the internal data (and would be detected by the various data analysis tools discussed in the later sections), but by then it may be too late for the enterprise. Proactive action is always better than reacting to external changes after the effects are felt. The conclusion is that the information from internal systems must be enhanced with external information. The synergism of the combination creates the greatest business benefits. 

The importance of external data and the challenges faced in integrating external data with internally sourced data by Load Manager. Some externally sourced data (particularly time sensitive data), is often distributed through the internet.

Best Android App to Watch Free Movies & TV Shows on Your Smartphone: ShowBox Read More About News, Trends and Ideas that matter most to Entrepreneurs Visit SnapMunk
Reliability of Web Content 

Many question the reliability of web content, as they should. However, few analyze the reliability issue to any depth. The Web is a global bulletin board on which both the wise and foolish have equal space. Acquiring content from the Web should not reflect positively or negatively on its quality. 


Consider the following situation: If you hear, “Buy IBM stock because it will double over the next month,” your reaction should depend on who made that statement and in what context. Was it a random conversation overheard on the subway, a chat with a friend over dinner, or a phone call from a trusted financial advisor? The context should also be considered when judging the reliability of Web content. 

Think of Web resources in terms of quality and coverage, as shown in Figure 1
Below : 

Fig-1 :  Web-based Information Sources

Toward the top are information resources of high quality (accuracy, currency, and validity), and resources toward the right have a wide coverage (scope, variety, and diversity). The interesting aspect of the web is that information resources occupy all quadrants. 

In the upper center, the commercial online database vendors traditionally have supplied business with high-quality information about numerous topics. However, the complexity of using these services and the infrequent update cycles have limited their usefulness. 

More to the left, governmental databases have become tremendously useful in recent years. Public information was often available only by spending many hours of manual labour at libraries or government offices. Recent developments like the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database maintained by the U.S. Securities and Exchange Commission provide valuable and up-to-date data via the Web. 

At the left, corporate Web sites often contain vast amounts of useful information in white papers, product demos, and press releases, eliminating the necessity to attend trade exhibits to learn the “latest and greatest” in a market place. 

Finally, the “doubtful-quality free” content occupies the lower half of the figure. Its value is not in the quality of any specific item but in its constantly changing diversity. Combined with the other Web resources, the doubtful-quality free content acts as a wide-angle lens to avoid tunnel vision of the market place. 

Web Farming

Like operational systems, the Web farming system provides input to the data warehouse. The result is to disseminate the refined information about specific business subjects to the enterprise sourced from the Web. 

The primary source of content for the Web farming system is the Web because of its external perspectives on the business of the enterprise. As a content source, the Web can be supplemented (but not replaced) by the intranet web of the enterprise. This content is typically in the format of internal Web sites, word processing documents, spreadsheets, and e-mail messages. However, the content from the intranet is usually limited to internal information about the enterprise, thus negating an important aspect of Web farming. 

Most information acquired by the Web farming system will not be in a form suitable for the data warehouse. Also, as discussed above, the source and quality of the content need to be judged. In any case, the information must be refined before loading into the warehouse. However, even in its unrefined state, the information obtained from the Web, through Web farming, could be highly valuable to the enterprise. The capability to directly disseminate this information may be required via textual message alerts or “What’s New” bulletins. 

Refining Information

When a data warehouse is first implemented within an enterprise, a detailed analysis and reengineering of data from operational systems is required (see Section on Load Manager above). The same is true for Web farming. Before Web content can be loaded into a warehouse, the information must be refined. 

The processes of refining information consists of four steps: 

Discovery, Acquisition, Structuring, and Dissemination. 

Discovery is the exploration of available Web resources to find those items that

relate to specific topics. Discovery involves considerable detective work far beyond searching generic directory services, such as Google,  Yahoo!, Bing or Alta Vista. Further, the discovery activity must be a continuous process because data sources are continually appearing and disappearing from the Web.

Acquisition is the collection and maintenance of content identified by its source. The main goal of acquisition is to maintain the historical context so that you can analyze content in the context of its past. A mechanism to efficiently use human judgement in the validation of content is another key requirement.

Structuring is the analysis and transformation of content into a more useful format and into a more meaningful structure. The formats can be Web pages, spreadsheets, word processing documents, and database tables. As we move toward loading data into a warehouse, the structures must be compatible with the star-schema design and with key identifier values.

Dissemination is the packaging and delivery of information to the appropriate consumers, either directly or through a data warehouse. A range of dissemination mechanisms is required, from predetermined schedules to ad hoc queries. Newer technologies, such as information brokering and preference matching, may be desirable.

0 comments:

Blog Widget by LinkWithin