This article originally appeared on the BeyeNETWORK
The year 2008 is already half over. It is time to update data warehousing trends, predictions and recommendations for the second half of 2008. Trends toward simplification, rationalization, usability and consolidation at the front end of the system (“business intelligence”) are still on target to fulfillment.1 However, three events in the market are driving additional trends around data warehousing, including column-oriented analytic databases, the maturing appliance market and open source databases. All of these are occurring in the context of a deteriorating economic situation with fuel prices reaching new highs, and consumer confidence reaching new lows, the latter tracking home equity value downward. While software is obviously not as sensitive to the price of fuel as business travel and the transport of physical goods, technology providers cannot long remain immune if corporate buyers and their customers are hurting. These provide reasons for an update on trends.
Three public benchmarks for column-oriented data warehouses are now available. It is a bold statement of the obvious that three constitutes a critical mass of data and a trend. Two of the three have shown up since April 2008 from Kickfire and Exasol. The third, ParAccel, started the competition in 2007 by publishing audited benchmark results (for example, see tpc.org) with a posting that weighed in with a price/performance metric about an order of magnitude less than some of the already published legacy results out there. Those who follow benchmarks know that you do not necessarily rush out to buy whoever is most cost effective in a given quarter since the vendors tend to leapfrog one another. As the only audited benchmark, the TPC is one of a kind in the current market. It provides data warehousing hardware and software vendors with “table stakes” and a place to start conversations about technical configurations, price, and tuning tips and techniques.
However, if column-oriented databases choose to emphasize the column rather than the row in a data structure, then that is an interesting interpretation of the relational model, not its abandonment. The point of the relational model – and the genius of Codd’s declarative definition – was to abstract from the internal, underlying (physical) representation of the data by means of a declarative specification of the methods of data representation, access and control. Of course, if there are internal mechanisms such as pointers involved in the physical implementation, then some relational purists might have issues; but even standard relational databases have to track physical location. Sybase IQ, SAND/DNA (formerly Nucleus) and Alterian are column-oriented, use an SQL interface (mostly), and they demonstrated technical viability years ago, even if traction in the market has been limited. Still, the reaction to columnar databases is bound to be – what is the big deal?
The big deal is not the column orientation, granted that it intrinsically shrinks the data as a function of the method of data representation even prior to use of formal compression algorithms. The big deal is that each of the contenders is bringing additional innovations to the technology. When layered on top of the advantages of column orientation, the new age analytic databases provide end-user enterprises with additional opportunities for high performance analytics from “best of breed” vendors, storage savings in an area where “big data” is common and validation of the maxim that working smarter beats working harder. The big deal is that just when your OLAP engine was about to run out of gas, the analytic database appears on the horizon. It has a standard SQL interface, is column based, and implements parallelization, shared nothing architecture and in memory acceleration.
For example, Kickfire is using what is basically a “hardware assist” – like a graphics co-processor but for queries – to lower costs per query. It uses an SQL parallelization processor with on-board cache, dedicated memory and integration with its MySQL database. Exasol is leveraging algorithmic inventions in “in memory” data caching, a proven performance enhancer. Obviously, the innovations are not limited to those issuing benchmarks.
ParAccel provides column-oriented data representation, advanced compression, shared nothing architecture, parallelization and in memory capability. ParAccel is providing “parallelism without porting” on commodity 64-bit servers, along with automatic failover and high availability features, since data warehouses are now often mission-critical. The ParAccel Amigo implementation targets Microsoft SQL Server, but other databases are in preparation. While ParAccel is not the only start-up to boast visionary management – for example, with Michael Stonebreaker’s leadership at Vertica – ParAccel also has its share: Barry Zane (CTO) earned campaign ribbons at Netezza and Applix; Bruce Scott (VP Engineering) was employee number four at Oracle; and David Ehrlich (CEO) was responsible for strategy and marketing at NetIQ.
Those start-ups and long-term players who are riding the wave from the splash made by the published benchmarks are also working smarter. Vertica commercializes the C-Store column-oriented database. It provides materialized views (“projections”) across the cloud (“grid”) in such a way as to support data redundancy and reliability. It is addressing the issue of update with write-optimized storage (WOS) using a “trickle loader” prior to long-term persistence on read-optimized storage (ROS). Vertica claims that WOS solves the issue of not being able to query an up-to-date version since it is available for query as part of the update process, reducing latency. Vertica notes that WOS is basically a temporary, work-in-progress location to be used for a short period of time (minutes, not days) to address the issue of concurrent update and inquiry. Still, the Vertica training manual (see page 5) states, “However, there is no join with itself or another fact table. Vertica is optimized for working with one fact table at a time in queries.” So be sure that the latest documentation is up-to-date, and understand that one person’s innovation may be another’s trade-off. Vertica is offering pricing innovations, too, dispensing with CPU-based pricing and enabling a standard price for 1TB (prior to compression!) on as many CPUs as the user can harness to the workload.
Infobright is applying “rough set theory” – related to fuzzy logic – to enable what I would describe as “smart data” (this is not Infobright’s term) without the requirement for indexing. Infobright calls it an “intelligent knowledge grid,” an automatically tuned, ultra-small layer of metadata. Data profiling occurs at a low level “under the hood.” The data gets smarter – faster at answering the questions posed by queries through metadata intrinsically associated with the data “under the hood” and without the overhead of building or maintaining indexes. The Infobright technology applies what were previously regarded as data mining methods to very large databases. This produces data mining results, reducing uncertainty in decision making. From the perspective of data warehousing trends, the point is that once a firm has a data warehouse of clean, consistent, high quality data, then the obvious use of it is data mining and predictive analytics. The recommendation is to watch and plan for more of the latter.
This is the first time in several years that the word “appliance” does not top the list of data warehousing trends. As a trend, appliances show no sign of let up in causing deals to migrate from the standard relational database vendors and, in some cases, back again. The issue is whether it is a trend if (almost) everyone has one to offer. I stand by previously published statements that the appliance market is there for the taking by the “big guys” – large vendors such as HP, IBM, Microsoft and Oracle. But a word of qualification is in order. If the vendor has just joined the label “appliance” to what is actually a complex assembly of hardware and an “integrated solution,” then buyer push back is in the offing. An appliance-like reference architecture can contribute to bridging the gap between complex, one off custom data warehousing configurations and a “one size fits all,” shrink-wrapped technology stack. SGI is reportedly readying one using an Oracle database, Altix servers and SGI InfiniteStorage systems scaling from 5TB to several hundred. But if the vendor (not just SGI, any vendor) is waiting for an accumulation of orders in order to manufacture the box using legacy style methods requiring long lead times, then the deal is off.
When the value proposition is “drop it off, load it up, run it” – simplification, reduced time to results and commodity (low) cost infrastructure (hardware) – then a long lead time is a show stopper. This is necessarily so whether the time is used in ordering, assembling and tuning the appliance on the client site, in the vendor’s sales organization or upstream in the manufacturing process itself. An appliance is not a fruit or vegetable; however, the shelf life of the appliance is a short one in that the main reason to get it is to save time now, not next quarter. Having validated the concept, the appliance market is now suffering from lack of supply, in spite of the fact that there are so many sellers. Curious. The market is not oversold, some vendors are. For the first time in a long time, “buy now and save” is actually true. But by the time the vendors ramp up manufacturing and related delivery processes (actually giving sales something to sell), IT budgets will have felt the negative effect of the economic challenges now percolating through the economy.
A question for any prospective buyer of a data warehousing appliance – is there a single part number in the vendor catalog and on the order or invoice? While it is just a detail, it can be a telling one. If the product is still a bill of materials that requires a special associate’s degree to interpret, each part with its own separate SKU, the configuration may actually be a technical architecture, not an appliance. This is not to say that a technical architecture is bad. Far from it. It may offer flexibility and adaptability for workloads that cannot be envisioned in advance. Just do not buy it imagining that it will behave like an appliance or deliver the value of one.
If you look at the column-oriented data warehousing market, you find Infobright and Kickfire leveraging MySQL. On the appliance side, Greenplum and Netezza use a version of PostgreSQL, and DATAllegro uses Ingres. Of course, by the time these vendors are done with these databases, they are no longer necessarily open source in any ordinary sense of the words. But that is not the point. The point is that there is a third trend here – open source databases. This technology is infiltrating the market in many ways: “under the hood” in other technology products, around the margins in “non mission critical roles,” which, however, will be mission critical soon enough, and at the center in technology-advanced adopters.
I know, I know. Open source databases are not ready for prime time. They still lack “heartbeat” functions that make them capable of supporting mission-critical applications that require mirroring, rollback, automatic failover, redo and related components of high availability. Duly noted. These capabilities are a work in progress – in many cases, at least three years out. But what is less noticed is that open source databases – in particular MySQL from Sun and Postgres Plus from EnterpriseDB – are working their way through the enterprise as components of appliances, column-oriented data marts, numerous applications that are not mission critical, and “under the hood” of tools and technologies that do not want to bother with hard-to-manage relationships with large vendors. Make no mistake – open source is not for the technological beginner and should only be deployed in connection with a support package from a vendor that is going to be available on weekends, holidays and when you least expect to need it. Still, it is the one innovation that I can think of that will really disrupt the existing installed base of the standard “big guys.” It is truly an innovator’s dilemma, though not in the original sense intended by Clayton Christensen.2 It is one thing that puts fear into the hearts of standard relational database software providers everywhere. This is not because it is “only” ten years out, but because every proprietary, technology feature and function (“hook”) added by the “big guys” and designed to lock in clients further will only make commoditization even more attractive and drive clients more forcefully and compellingly toward the alternative.
As the latest oil price shock ratchets through the economies of the western world, the Fed could finally decide to “defend” the dollar and raise interest rates. Cost cutting and low balling will move to the forefront. In the short term, these economic dynamics could actually accelerate data warehousing deals, since enterprises with budget will be incented to use it up before it evaporates due to CFO-mandated budget cuts.
Plan on a process designed for functional data marts. Plan on managing analytic column-oriented database, legacy OLAP data cubes or other special purpose data stores according to a process designed for functional data marts. If the analytic application is being fed from a centralized, enterprise data warehouse (EDW), then service level agreements will be derivative on, and possibly less rigorous than, those of the EDW. However, if the functional data mart, regardless of the platform, is being used to take the load off of the central EDW, then be prepared to implement a process to conform the data to the central design and to fold it back into the center as required.
Adopt parallel technology to break through the bottleneck at load time. For example, using the standard single loader that operates on a single data stream, ParAccel can load some 750GB per hour. With parallelization, that delivery rate can be substantially increased.
Avoid data warehousing religious wars. The star schema is a powerful and proven way of representing basic data warehousing inquiries. However, data represented in snowflaked form also satisfies many business applications. Some column-oriented databases use data profiling technology as a way to circumvent the requirement of indexes altogether. Others throw massive parallel processing at scanning or store condensed data in memory. Look for database optimizer technology that supports multiple formats and query tactics and does not force a choice where no choice is required.
The advantage goes to the innovator. But end-user enterprises are still cautioned to make the time to check references and visit customers using similar technology that you met at user group conferences or vendor trade shows. An uncensored conversation with an end user at an enterprise like your own that is using the technology in production and has a few “war stories” is worth as much as a trip to the vendor lab, though you should also undertake the latter. Meanwhile, from the perspective of the vendor, work with prospective clients to create customers so satisfied that they will want to tell your story to others. Otherwise, the success of the software will be a tree that falls in the forest without anyone being there. It doesn’t make a sound.
Lou Agosta is an independent industry analyst, specializing in data warehousing, data mining and data quality. A former industry analyst at Giga Information Group, Agosta has published extensively on industry trends in data warehousing, business and information technology. He is currently focusing on the challenge of transforming America’s healthcare system using information technology (HIT). He can be reached at LAgosta@acm.org.
Editor's Note: More articles, resources, news and events are available in Lou's BeyeNETWORK Expert Channel. Be sure to visit today!