This article originally appeared on the BeyeNETWORK.
This data warehousing roundup will cover several trends, innovations and vendors that were unable to be included in my recent article due to lack of space. This especially includes trends in “high end” data warehousing where the competition continues to be fierce as well as trends at the entry level of data warehousing.
Aster Data Systems (ADS), Dataupia, and SGI (the former Silicon Graphics) are moving from the periphery toward the center of competition in a data warehousing market that offers more options for buyers than ever before. It is always delicate to name three vendors in a statement, and they do all participate in the data warehousing market. None of these vendors is wedded to your ordinary, entry-level business intelligence problems (i.e., reporting), though they support the usual reporting connections and, in the case of SGI, have a long legacy of advanced visualization capabilities. However, each has distinctive capabilities, and it will be useful to look at the trade-offs as examples of different approaches to data warehousing. These can be generalized and abstracted into a lesson about the market, innovation and data warehousing itself.
Silicon Graphics (now simply SGI) became well known for supplying high-end graphics workstations to Hollywood for special effects in movies. However, the fact is that SGI has always earned most of its revenue less conspicuously by providing high performance computational infrastructure for large governmental bodies at home and abroad, such as meteorological services, bioinformatics and NASA, as well as groups that require secrecy. All of these combine substantial storage requirements along with computationally intense data analysis.
SGI’s innovation is to make possible extreme main memory capabilities to eliminate external data access (I/O) to disk. Here “extreme” means that the SGI Altix 4700 supports up to 128 terabytes (TB) of main memory. With SGI’s NUMAflex architecture, the SGI Altix system combines the large physical address space of the Intel Itanium processor, with a system interconnect capable of distributing that address space across hundreds or even thousands of nodes. The SGI interconnect (NUMAlink 4) provides a raw single link aggregate transfer rate of 6.4GB/s (3.2 in each direction). Commodity clusters reportedly use links with aggregate performance ranging from perhaps 100MB/s to 2.5GB/s. The Blade-to-NUMAlink Architecture (that’s part of the innovation) allows compute blades to be combined with memory-only blades. Memory-only blades with a similar DIMM1 configuration allow system memory to be scaled independent of processing capability for large memory problems.
The approach of SGI’s Adaptive Data Warehouse is in contrast to many of the data warehousing solutions at the high end that have chosen a so-called shared nothing architecture. The main memory and external storage are subordinate to the individual processing node. The main benefit is that the architecture enjoys unlimited scalability. As more nodes are added to the configuration, the coordination costs are completely predictable and generally linear. The possible disadvantage occurs when a problem requires significant data movement between nodes. In addition, large scale data access to disk is an issue for all commercial business systems that tend to be I/O bound. With the exception of SGI – and some versions of IBM’s Sysplex mainframe – extreme memory spaces have not been the approach of choice for data warehousing except for the computationally intense problems indicated. Technically, pushing processing down closer and closer to where it is needed continues to be a method for improving performance as exemplified by SGI’s implementation of field programmable gate array (FPGA). However, unlike Netezza where the FPGA is a part of the I/O card, SGI puts the co-processor on a blade of its own (RC100) where it has visibility to all of main memory.With the emergence of new business problems driven by frontline data warehouses in e-commerce, social networking and Web 2.0, the advantages of large in-memory solutions for intense business analytics will become more prevalent.
Aster Data Systems
A frontline data warehouse is an idea whose time has come. The notion of a “frontline” data warehouse (FLDW) changes the meaning of “data warehouse,” but preserves the analytic dimension. It has the size of an enterprise data warehouse (extremely large), the latency requirements of operational business intelligence (extremely low) and the service level agreement of a transactional system (7 x 24). The poster child for a FLDW is MySpace: lots of storage, lots of interaction and lots of analysis to drive revenue from eyeballs – daily web logs of 2TB a day to be aggregated, analyzed and boiled down.
Aster Data Systems is setting its sights on being known for handling frontline data warehousing. It sets a high bar for itself with an “always parallel” and “always on” marketing message. Based on conversations, this is how they propose to back up the claims. Scalability is addressed by innovations in autonomic computing that allow the queen node to manage updates to all the other nodes in the system based on automated monitoring of the process of update distribution.
These capabilities have been available – at least in theory – from vendors such as IBM, HP and Teradata. In the case of IBM, online redistribution was a part of its scalable parallel (SP) architecture that dates back to 1995. However, while the process always functioned as designed, real world issues about degraded performance and complexity prevented the functionality from reaching its full potential in the operational data center context that requires high performance and short downtime. According to Aster, “bare metal” servers can be plugged into the network and the Aster nCluster database automatically installs the entire software stack (OS, database, application), incorporating the new server into the cluster, including rebalancing data distribution in an online fashion (no disruption to queries during data re-balancing).
Let’s put it a way that any skeptical industry analyst can understand. With more than 100TB of data under management at MySpace, Aster is doing something that the client probably could not afford using a conventional data warehousing appliance reference architecture or there is one heck of a science experiment going on over there. Not to take anything away from the authentic innovations at Aster, but it may be that frontline data warehousing is more forgiving of certain kinds of failure that can be recovered before anyone notices. However, no environment is less forgiving of latency than the online one.
The “aha” moment that the innovators at Stanford University (who eventually formed Aster) had regarding failover is this: Because failures are an inevitable fact of life, if a system can recover from a failure before anyone notices, then that is (almost) as good as if the fault never occurred. (This paradigm is based on “recovery-oriented computing” principles.) Indeed, if it minimizes the need for redundant systems that (by definition) cost twice as much, then it is even better. In brief, that is the basis of the innovations that Aster is bringing to frontline data warehousing.
For those organizations whose analytic requirements necessitate intense data analysis, Aster offers an in-database implementation of MapReduce, which was popularized by Google for high-speed data filtering that occurs close to the disk in parallel across all applicable processors. In addition, the approach avoids the dilemma of using proprietary stored procedures or user defined functions that require single threading of all processing as through a choke point. Parallel processing is stopped dead. Combined with built-in parallel loading, Aster’s client aCerno is able to perform rich consumer behavior analysis of 75 million online shopping events a day. Targeted online advertising that identifies when buyers are “in market” and shopping is unforgiving of delay.
“With shopping behavior of more than 140 million consumers to analyze daily, increasing data volumes were pushing our analysis latency to unacceptable levels,” said Peter Kools, chief technology officer for aCerno.
Result? What had been a 20-hour lag time was reduced to 6 hours or less to give advertisers and media partners quick access to vital information.
What this analyst is trying to get a bead on as this article goes to press is how much management and administration, including installation overhead, is required by a client who goes with Aster, given the approach of configuring commodity (e.g., Dell or entry level HP or IBM) servers? To that extent, Aster is not an appliance – nor does it claim to be one, but it joins the esteemed ranks of appliance reference architectures along with IBM and Sun.
In contrast, Dataupia offers server, storage and optimization storage packaged as a single appliance. However, the key differentiator is that it aims to leverage the existing data stores using DB2, Oracle or Microsoft SQL by means of optimizing each vendor’s existing API for remote databases. In Dataupia’s case, the database is driven by an MPP engine that sits behind the host database server. End users and applications operate on database objects of the same name that are transparently available and managed through a higher performing, more efficient Dataupia Satori Server. Integration with existing applications is facilitated as Dataupia looks like additional tablespace to Oracle (or DB2 or SQL Server). Dataupia calls this “omniversal transparency,” and functionally it does resemble what Oracle and IBM describe as “federation” but with this difference: the data is persisted on the Dataupia appliance instead of having to be revirtualized with every pass. In addition, Dataupia provides multiple data access methods that enable a wide range of data warehouse workloads to operate on the same system. Their methods offer both advantages in scalability and in the ease of use of an appliance. Another aspect of this “data utopia” is reportedly the pricing – a 2TB Dataupia Satori Server costs some $19,500, including both hardware and software.2 The obvious additional benefit is that the client is able to utilize its existing standard database.
No Database Expertise?
But what if you are a new business with a modest amount of data – essential to running the business – and zero database expertise? You are not a database administrator or developer, though your brother-in-law goes bowling with one now and then. What are you going to do? You Google “SimpleDB,” thinking to find out about Amazon’s cloud database as a way of getting up and running quickly if you can find a developer, and instead hit “TrackVia.” It is not oriented toward developers, which is good for you since you aren’t one. By the way, this is a wonderful example of how Amazon by the very existence of its SimpleDB is unintentionally creating a whole new ecosystem for online databases. Strictly speaking, the TrackVia offering is for software as a service for business users, not a cloud-based developer environment. It provides the ability to do precisely the sorts of things small and medium-sized businesses need to do: organize inventory, assets, locations or partners; sort and filter information; manage mailing lists, clients and contacts; create custom reports; share data in real-time; and set up different levels of permission-based access.
TrackVia lives by the mantra that search is the killer application. TrackVia has done some original work with searching through relational structures – “googling” in the generic sense of the word by means of a metadata abstraction layer that TrackVia will not talk about because it is their “secret sauce.” However, it does enable a powerful searching capability that is simple, fast and smart without the fuss and bother of building a custom SQL statement. TrackVia’s searching supports:
- Phrases – Quotation marks, as in “New York” instead of New York, indicate that the terms must appear together.
- Exact match – An equal sign, as in =Rob instead of Rob, will only return records with an exact match (i.e., Rob, but not Robert).
- Negative match – A minus sign, as in white sox – red, means a term (i.e., red) must not appear in a record.
- Dates – You can now search date and time fields with expressions like Oct 23, 2007 or simply February.
- Empty fields – Searching on (none) will return records with blank fields.
- Specific fields – Putting a field name before a term, as in first name: Wilson, limits the search for that term to that one field.
For those organizations without database expertise and an interest in not getting involved in data management, TrackVia is a choice that belongs on the short list along with Intuit QuickSample and (for those with a more developer mentality) Amazon’s SimpleDB.
- DIMM (dual in-line memory module – dynamic random access integrated memory circuits). Also, it should be noted that innovation is no guarantee of success and that SGI emerged from bankruptcy in October 2006. In 2007, revenues were $386 million.
- As reported by David Raab’s www.customerexperiencematrix.blogspot.com/ whose comments on Dataupia (and technology in general) are always worth exploring.
Lou Agosta is an independent industry analyst, specializing in data warehousing, data mining and data quality. A former industry analyst at Giga Information Group, Agosta has published extensively on industry trends in data warehousing, business and information technology. He is currently focusing on the challenge of transforming America’s healthcare system using information technology (HIT). He can be reached at LAgosta@acm.org.
Editor's Note: More articles, resources, news and events are available in Lou's BeyeNETWORK Expert Channel. Be sure to visit today!