Amazon.com upended brick-and-mortar book stores and more on its way to dominance in the worlds of e-commerce and cloud computing. Now its sights are set on the data warehouse in the cloud.
As in its past forays, low cost will mark its effort. Cost has often been one of the most painful aspects of data warehousing.
Seattle-based Amazon last week showed a limited preview of the new Amazon Web Service (AWS) offering, called Redshift. The company discussed the new data warehousing services at the AWS re:Invent 2012 conference in Las Vegas. Amazon Redshift comes billed as a massively parallel data warehouse comprising one or more Redshift "cluster nodes" that are accessible by AWS application programming interfaces (APIs), and some standard data interfaces.
"It allows you to easily and rapidly analyze petabytes of data," claimed Andy Jassy, senior vice president of Amazon. "It's about a tenth of the cost of traditional data warehouse solutions." In fact, Amazon's estimates put the cost at less than $1,000 per terabyte per year, with configurations topping out -- for now -- at 1.6 petabytes. For data warehousing, this is cheap.
Redshift follows on the heels of other cloud data services the company built up over recent years. To its early elastic caching services, it has added relational databases, NoSQL databases, content distribution and data analytics. These services can be purchased on a pay-as-you-go basis.
To date, the Redshift AWS data warehouse service has only been available in a controlled beta to a small group of customers that include Flipboard, NASA/JPL Netflix and others. Now, the beta program is being expanded, with general availability expected in 2013.
Amazon said Redshift includes technology components licensed from ParAccel Inc., maker of analytic appliances. The ParAccel offering is advanced, suggesting that Amazon is pursuing high performance, and not just low-cost points, as it launches its cloud-based data warehousing venture.
Does Redshift herald data shift?
"The Amazon announcement heralds the beginning of a mass shift of business intelligence and data warehousing to the public cloud," said Wayne Eckerson, director at business intelligence (BI) leadership research at TechTarget Inc., the parent of SearchDataManagment.com.
The progress of data to the cloud has been slow so far, he said, as people grappled with security and data uploading bottlenecks, as well as issues of custom development and reliability. "Nonetheless, the benefits of public cloud implementations are too great to hold back for long," Eckerson said.
Amazon's Jassy stressed potential cost benefits, especially savings related to managing the data warehouse once it's built. "Anybody who has used a traditional, old guard data warehouse knows that it is really expensive and really complicated to manage," he said.
More on cloud data integration
Find out why one marketing firm chose Informatica Cloud software
Learn about a cloud-based CRM project
Read about the sharp rise in BI-related data integration challenges
Jassy pointed to data from Stamford, Conn.-based IT research firm Gartner Inc., which indicated that companies need to hire three to four database administrators for every data warehouse they have. "It's a pain in the butt to manage," he said.
This cost-of-ownership issue has long stalked data warehousing. A promise of cloud computing is that it offers economies of scale, although customization needs can cut into this advantage. Over time, the cloud versions of data warehouses are likely to gain in favor, Eckerson said.
"If you can eliminate your DBAs and data center, yet receive better performance at [a] lower price, then BI in the cloud is inevitable," he said. But it will still take time, he continued, "for the obvious cloud advantages to overcome embedded tradition and take root in corporate computing environments."
Data warehouse customization conundrum
The pace at which data warehousing moves to Amazon or other clouds will vary, depending on the type and size of a company, and the specifics of its data customizations, said Sandy Williamson, who heads Reston, Virg.-based CapTech Consulting, which includes data warehousing among its key practices. Williamson shows some skepticism about Amazon Redshift's prospects among larger organizations.
"Large corporations build their own private clouds and they are not letting their data out to the public cloud," he said. "Their understanding of their consumers is there."
For CapTech's part, he said, the consultancy already makes use of some cloud tools to build databases, but not necessarily on a large scale yet. "We use some of Amazon's cloud development platforms for mobile development and for some prototyping," he said. Like others, Williamson suggested that new applications are likely better candidates than existing ones for duty on Amazon or other cloud computing platforms.
"The real problem with doing data warehousing and business intelligence in the cloud is that it's usually a custom development project," Eckerson said. "Most [Software as a Service] cloud offerings are more packaged in nature and easier to purchase and install."
With BI, you need to create a custom data model based on your organization's unique structure, requirements and data sources, he said, adding that custom reports are required on top of those infrastructure elements. The agility and speed that cloud offers is tempered a bit in the BI world, Eckerson said. Data movement too, can be an issue.
The Redshift concept has merit, according to Sandy Williamson's CapTech Consulting colleague Ben Harden. "Amazon has the ability to add CPUs exponentially to work on data sets. Once the data is there, you can do all the slicing and dicing," said Harden, director and business intelligence expert at CapTech. "But getting the data there is far easier for those that are already there."
Data movement issues are still to be resolved for cloud implementations, including those of Amazon. So, the Amazon offering, several viewers stated, is now mostly aimed at its own existing customers -- ones whose data is already in place on the Amazon cloud.
"How do you get a petabyte of data into the cloud? You don't just 'FTP' that in an hour," Harden said. "The market for this largely has to be for people whose operations or e-commerce sites are already on the Amazon cloud."
While reduced management work and low cost is a major thrust in Amazon's effort, the company points to speed increases with Redshift, too. Assessments are based on its own experience.
Amazon's Jassy said the company included parts of its in-house Amazon Enterprise Data Warehouse -- which it estimates, has cost "multi-million" dollars to create -- in the private beta program for Amazon Redshift.
Company data managers said they have seen multi-hour queries finish in under an hour, and some queries that took five to 10 minutes on the current data warehouse were returned in seconds with Redshift. This could in some part be due to use of the high-performance ParAccel components that Amazon is clearly aiming at a larger audience.