This article originally appeared on the BeyeNETWORK.
Web-based systems are a valuable source of information for business intelligence (BI) analytical processing. The business content in these systems can be used for a variety of applications, from competitive pricing analysis and determining customer-buying trends to measuring call center performance and analyzing product feedback in blogs.
Two issues that arise when using web content for business intelligence are how to capture web content and integrate it into a data warehouse ready for analysis, and how data governance policies and procedures should be updated to deal with the dynamic and potentially inaccurate nature of web-based information. This article discusses the analysis of web content and reviews governance guidelines not only from a data quality perspective, but also from a legal and etiquette viewpoint.
Types of Web Content
In general there are two types of web-based information of interest for use in analytical processing. The first is the actual content of the web pages themselves. This content covers a vast of range of different types of information. Examples include product pricing and special offers, government statistical data, call center feedback forms, customer complaint and feedback data in blogs and email, product review information, social tagging/feedback information and so forth.
The second type of web information involves tracking the network interactions between web users and websites. This information helps organizations analyze web performance, site navigation and search queries, abandoned shopping baskets, competitive websites users switch to and so forth.
Most of the examples described involve customer-facing web applications. Web analytics, however, apply equally to other types of web applications from corporate intranets and extranets to financial trading web applications.
Capturing and Analyzing Web Content
Before web content can be processed by analytical applications, it must be captured and transformed into a format suitable for analysis. In the days of fixed and static web pages this was fairly easy to do; the capture application simply had to extract the appropriate HTML data from the website. Today, however, websites are more dynamic and more complex. The increasing use of rich Internet applications (RIA) that employ technologies such as Ajax and Adobe Flash make it more difficult to capture web page content and website interactions.
Three common approaches used to capture web content are:
- Use a Web API. Not all websites have a formal API, but many of those that do are documented on www.programmableweb.com.
- Employ a Commercial Product. There are a variety of products that will scrape the contents of web pages and convert them into an XML file. One interesting product example here that shows what is possible is the Kapow Mashup Server, which provides a workflow capability that can enter keywords into a web page, capture web content and route the results to an XML file, web service, email, etc.
- Capture Web Data Syndication (RSS/Atom) XML Feeds.
Many companies use a combination of all three approaches. The ideal result in all cases is an XML file of the required web content.
Analyzing the Content
The XML output from the capture process can be routed to a data integration tool for loading into a data warehouse, or can be analyzed by a separate content analytics application.
When web content is to be used to supplement existing structured data, it is useful to integrate the content into a data warehouse. However, when the majority of the information being analyzed is unstructured content and also possible semi-structured data such as XML information, then the use of a content analytics application may be more appropriate. Such applications vary from extensions to search products, to dedicated web analytics applications. One of the most commonly used web analytics applications is Google Analytics. One website that has good information about products that support content and web analytics is http://www.cmswatch.com/.
Governing the Content
One of the most common questions I get about web-related content is how to govern it. As Steve Krol of the Lyons Consulting Group said at an Enterprise 2.0 conference in Rome last December, “Web 2.0 has fundamentally different rules than traditional enterprise technology, and requires changes to the traditional IT governance model.”
Applying a traditional approach to governing web content defeats the instant and interactive nature of web and social computing technologies. Instead, Steve Krol suggests that the governance model should be based on the technology that is being governed. For example, web content should be self-governed, whereas content management systems require tight governance. Social tagging and social networking, on the other hand, fall somewhere between these two extremes.
New social computing technologies such as blogs, wikis, social tagging and social networks tend to be the biggest areas of concern in enterprises because their use is growing rapidly. Whereas data quality is an obvious issue, many companies are beginning to accept that good enough web data is sufficient for many analytical applications. Privacy and confidentiality are often the bigger problems.
Certain types of data may have to be rigidly controlled for legal reasons. Financial and personnel data are examples. Other types of data need not be so rigidly controlled, and this is where more flexible governance policies and procedures are required.
The instant nature of social computing requires that governance policies include etiquette guidelines. In his new book, Snark, David Denby had this to say about information on the Internet, “There’s both an enormous explosion of information and expression, much of it useful or fun, and also an explosion of pent-up rage, social anguish, resentment, bilious and annihilating nastiness, prejudice and all the rest of the dark side.”
Robert Scoble, best known for his blog, Scobleizer, which came to prominence during his tenure as a technical evangelist at Microsoft, recommends that bloggers should amongst other things “have a thick skin” and should “avoid writing during times of emotional turmoil.” This may be good advice since blogging can get you fired as Chez Pazienza, a CNN producer discovered when he supposedly ignored vague corporate guidelines.
The CNN situation demonstrates that some clear policies are required. I think IBM’s guidelines to its staff on the use of social computing are clear, while at the same time pragmatic. Some of the more general ones include:
- Be mindful that what you publish will be public for a long time
- Respect copyright, fair use and financial disclosure laws
- Ask permission to publish or report on conversations that are meant to be private
- Don't cite or reference clients, partners or suppliers without their approval
- Respect your audience
- Don't pick fights
- Try to add value
The bottom line is that web-based content is a powerful source of information for business intelligence analytical processing, but organizations need to carefully select the types of content and technologies used, and need to have a pragmatic approach to governing this content.