photobank.kiev.ua - Fotolia
Big data in the cloud is something like science-fiction writer William Gibson's famous description of the future: It's here -- it's just not very evenly distributed.
High-profile vendors such as Amazon Web Services, Google, Microsoft, IBM and Rackspace offer cloud-based Hadoop and NoSQL database platforms to support big data applications. A variety of startups have introduced managed services that run on top of the cloud platforms, freeing users from the need to deploy their own systems there. And mixing big data and cloud computing is often the first choice for Internet companies, especially software and data services vendors that are just getting started themselves.
But many mainstream organizations don't view managing data in the cloud the way the Web wunderkinds do. Some get white knuckles about data security and privacy protections in the cloud. Others still run most of their operations on mainframes and other well-entrenched systems far removed from cloud architectures. And the sheer mass of data stored in such systems makes moving it to the cloud physically challenging. Moreover, available processing capacity in existing data centers makes the promised financial benefits of using public clouds like AWS and the Google Cloud Platform less compelling, even to companies that are interested in taking advantage of the reduced costs and increased flexibility that cloud-based systems can provide.
Citigroup Inc. is a case in point. The financial services company is faced with an incoming flood of unstructured data as the Web becomes a ubiquitous application interface. It also has to deal with a mix of different data structures in online financial applications. Those challenges led Citi to adopt MongoDB's namesake NoSQL database. MongoDB is supported on AWS and other cloud platforms, and Citi is taking a cloud approach with the software, said Michael Simone, global head of CitiData platform engineering. But it's a private cloud, built within the confines of the New York company's corporate firewall and fully managed by its IT department.
"For now, we're not committing to extended or public cloud integration," Simone told attendees at the MongoDB World conference in New York in June. "Citigroup's data centers are vast and deep themselves, and we feel we can build an economical [on-premises] cloud."
Big data cloud not crowded yet
Overall, running big data systems in the cloud is still a minority affair. Of 222 IT and business professionals who completed an online assessment between October 2013 and last May against a big data maturity model developed by The Data Warehousing Institute, only 19% said their organizations were using public, private or hybrid clouds to support big data applications. Another 40% said cloud deployments were under consideration, but more than one-third said they had no plans to use the cloud (see Figure 1). An online survey conducted in the summer of 2013 by consultancies Enterprise Management Associates and 9sight Consulting found a somewhat higher level of usage: Thirty-nine percent of the 259 respondents said their big data installations included cloud systems.
The Weather Channel LLC is one company that has jumped into the public cloud, running replicated instances of Riak, a NoSQL database from Basho Technologies Inc., across multiple partitioned AWS availability zones to process and store a mix of data from satellites, radar systems, weather stations and other sources. The database helps feed forecast engines that update views of 36,000 geographical weather grids every five minutes; it's also used to archive historical data.
Bryson Koehler, executive vice president and CIO at Atlanta-based TWC, credited the Riak technology for its fault tolerance and its support of both in-memory and disk-based processing. By comparison, he said, mainstream relational databases aren't geared to high-volume cloud environments, at least at a low cost because of processing inefficiencies. But implementing the NoSQL software in the cloud is part of a broader IT strategy aimed at giving TWC the flexibility to change course as needed. The company runs applications on Google’s cloud as well as AWS, a move aimed at preventing it from becoming "too locked in to any provider or technology," Koehler said.
More cloud flavors to choose from
Public-cloud providers have expanded their data management capabilities well beyond plain-vanilla relational databases, partly in a quest to meet big data needs. For example, Amazon.com over the years has broadened its AWS cloud options to include technologies such as DynamoDB, a NoSQL database, and Elastic MapReduce, a Hadoop implementation, as well as the ElastiCache in-memory caching service, Redshift data warehouse and Kinesis streaming data system.
By this point, AWS and other cloud vendors have also created "pretty sophisticated services," according to David Linthicum, a senior vice president at consultancy Cloud Technology Partners in Boston. Some of the available cloud data management platforms are "in their fifth and sixth generations," he said. "These products have been beaten down and built back up."
For large companies with ample internal processing power, though, adding external cloud-based systems to manage pools of big data doesn't necessarily compute. "Why pay a subscription for something you already have? Customers that have invested hundreds of thousands of dollars in storage architectures aren't going to just walk away from that," said Aaron Ebertowski, lead infrastructure architect at Nimbo, a cloud services consultancy in Houston.
Performance requirements can also be a factor in steering big data users away from the public cloud. Ocean Networks Canada, a not-for-profit organization that operates a pair of ocean observatories in British Columbia, plans to set up an on-premises private cloud for an application that will use data from marine sensors to run simulations of earthquakes and tsunamis. The goal is to better predict the results of possible natural disasters so government authorities can take precautions to moderate their impact on people, said Benoit Pirenne, associate director of digital infrastructure at ONC.
Abundant big data power needed
The organization, which is based at the University of Victoria, got approval and funding for the three-year project last spring. The planned analysis work involves collecting a diverse set of measurements from the sensors and running predictive models to produce a large library of possible scenarios. But accomplishing that will require a lot of data and a formidable amount of computing power to crunch through it all, according to Pirenne.
"To calculate [the simulations] in real time is pretty much impossible even on a very fancy parallel cloud system," he said. As a result, ONC is working with IBM to build an internal cloud architecture to handle the processing and analysis workload.
The emerging managed services providers -- companies such as Altiscale, BitYota, Qubole, Treasure Data and Rackspace's ObjectRocket subsidiary -- claim that they can make big data cloud installations more convenient and cost-effective for willing user organizations by taking over deployment and administration tasks and offering to do so at a lower price than the cloud platform vendors charge.
Sellpoints Inc., an online marketing and analytics services provider in Emeryville, Calif., uses Hadoop together with the Spark processing engine to quickly build tables for querying "double digits of terabytes of compressed data" on the Web activity of consumers, said Benny Blum, the company's vice president of performance marketing and analytics.
Helping hands on Hadoop
Blum's group initially deployed its own Hadoop system on the Amazon Elastic Compute Cloud, or EC2, platform. But now it has relocated the system to Altiscale's Hadoop as a service offering. The service also runs on the Amazon cloud, but Blum said that offloading Hadoop configuration and management tasks is paying dividends for Sellpoints. "Altiscale manages the bare metal for us. So we don't have to pay the operational costs of maintaining the cluster," he said.
But thus far, at least, the users of such services tend to be emerging companies themselves. And most of the services providers still have customer bases in the single or double digits.
Rick Sherman, founder of consultancy Athena IT Solutions in Maynard, Mass., does see reasons to think that cloud-based Hadoop services in particular can take hold on a wider basis. "People are running up against a wall right now with roll-your-own Hadoop," he said. "It's a huge investment in time and skills. I think ultimately Hadoop as a service will be much more appealing than Hadoop on-premises."
Some organizations aren't ready for Hadoop in the public cloud at all, though -- and probably won't be for some time. That's the situation confronting Ayad Shammout, director of data platforms and business intelligence at Beth Israel Deaconess Medical Center in Boston. Last year, while working as an independent consultant, he collaborated with another consultant on a big data and cloud computing proof-of-concept project for BIDMC. Looking to reduce storage and processing demands on a SQL Server database, they used Azure HDInsight, Microsoft's cloud-based Hadoop distribution, to offload archives of application audit logs used for regulatory compliance reporting purposes to the Microsoft Azure cloud.
Shammout said the demonstration project showed the potential of running Hadoop applications in the cloud. And he thinks healthcare providers like BIDMC could eventually get there. But concerns about complying with the data privacy and security provisions of the federal Health Insurance Portability and Accountability Act stand in the way of a production deployment now, and perhaps for several more years.
"If I were talking with you about the cloud three or four years ago, I would have said, 'No, it is not going to happen,' " Shammout said. "Now I can see it for some departments. My expectation is that in another three or four years, [cloud data privacy] will not be the issue it is today."
Read about the cloud-based big data system implemented by a marketing analytics company
Get more advice from consultant David Linthicum on designing a big data cloud architecture
See what big data user InsideTrack did to integrate its cloud and on-premises data
Discover three ways to build a big data system