Business Information

Technology insights for the data-driven enterprise - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Big data and cloud computing look for bigger foothold in enterprises

Deploying big data systems in the cloud has become a popular path for Web-oriented companies. But many traditional enterprises aren't going there yet with their vaults of big data.

Big data in the cloud is something like science-fiction writer William Gibson's famous description of the future: It's here -- it's just not very evenly distributed.

High-profile vendors such as Amazon Web Services, Google, Microsoft, IBM and Rackspace offer cloud-based Hadoop and NoSQL database platforms to support big data applications. A variety of startups have introduced managed services that run on top of the cloud platforms, freeing users from the need to deploy their own systems there. And mixing big data and cloud computing is often the first choice for Internet companies, especially software and data services vendors that are just getting started themselves.

But many mainstream organizations don't view managing data in the cloud the way the Web wunderkinds do. Some get white knuckles about data security and privacy protections in the cloud. Others still run most of their operations on mainframes and other well-entrenched systems far removed from cloud architectures. And the sheer mass of data stored in such systems makes moving it to the cloud physically challenging. Moreover, available processing capacity in existing data centers makes the promised financial benefits of using public clouds like AWS and the Google Cloud Platform less compelling, even to companies that are interested in taking advantage of the reduced costs and increased flexibility that cloud-based systems can provide.

Citigroup Inc. is a case in point. The financial services company is faced with an incoming flood of unstructured data as the Web becomes a ubiquitous application interface. It also has to deal with a mix of different data structures in online financial applications. Those challenges led Citi to adopt MongoDB's namesake NoSQL database. MongoDB is supported on AWS and other cloud platforms, and Citi is taking a cloud approach with the software, said Michael Simone, global head of CitiData platform engineering. But it's a private cloud, built within the confines of the New York company's corporate firewall and fully managed by its IT department.

"For now, we're not committing to extended or public cloud integration," Simone told attendees at the MongoDB World conference in New York in June. "Citigroup's data centers are vast and deep themselves, and we feel we can build an economical [on-premises] cloud."

Big data cloud not crowded yet

Overall, running big data systems in the cloud is still a minority affair. Of 222 IT and business professionals who completed an online assessment between October 2013 and last May against a big data maturity model developed by The Data Warehousing Institute, only 19% said their organizations were using public, private or hybrid clouds to support big data applications. Another 40% said cloud deployments were under consideration, but more than one-third said they had no plans to use the cloud (see Figure 1). An online survey conducted in the summer of 2013 by consultancies Enterprise Management Associates and 9sight Consulting found a somewhat higher level of usage: Thirty-nine percent of the 259 respondents said their big data installations included cloud systems.

Partly cloudy conditions
Partly cloudy conditions

The Weather Channel LLC is one company that has jumped into the public cloud, running replicated instances of Riak, a NoSQL database from Basho Technologies Inc., across multiple partitioned AWS availability zones to process and store a mix of data from satellites, radar systems, weather stations and other sources. The database helps feed forecast engines that update views of 36,000 geographical weather grids every five minutes; it's also used to archive historical data.

Bryson KoehlerBryson Koehler

Bryson Koehler, executive vice president and CIO at Atlanta-based TWC, credited the Riak technology for its fault tolerance and its support of both in-memory and disk-based processing. By comparison, he said, mainstream relational databases aren't geared to high-volume cloud environments, at least at a low cost because of processing inefficiencies. But implementing the NoSQL software in the cloud is part of a broader IT strategy aimed at giving TWC the flexibility to change course as needed. The company runs applications on Google’s cloud as well as AWS, a move aimed at preventing it from becoming "too locked in to any provider or technology," Koehler said.

More cloud flavors to choose from

Public-cloud providers have expanded their data management capabilities well beyond plain-vanilla relational databases, partly in a quest to meet big data needs. For example, over the years has broadened its AWS cloud options to include technologies such as DynamoDB, a NoSQL database, and Elastic MapReduce, a Hadoop implementation, as well as the ElastiCache in-memory caching service, Redshift data warehouse and Kinesis streaming data system.

By this point, AWS and other cloud vendors have also created "pretty sophisticated services," according to David Linthicum, a senior vice president at consultancy Cloud Technology Partners in Boston. Some of the available cloud data management platforms are "in their fifth and sixth generations," he said. "These products have been beaten down and built back up."

For large companies with ample internal processing power, though, adding external cloud-based systems to manage pools of big data doesn't necessarily compute. "Why pay a subscription for something you already have? Customers that have invested hundreds of thousands of dollars in storage architectures aren't going to just walk away from that," said Aaron Ebertowski, lead infrastructure architect at Nimbo, a cloud services consultancy in Houston.

Performance requirements can also be a factor in steering big data users away from the public cloud. Ocean Networks Canada, a not-for-profit organization that operates a pair of ocean observatories in British Columbia, plans to set up an on-premises private cloud for an application that will use data from marine sensors to run simulations of earthquakes and tsunamis. The goal is to better predict the results of possible natural disasters so government authorities can take precautions to moderate their impact on people, said Benoit Pirenne, associate director of digital infrastructure at ONC.

Abundant big data power needed

The organization, which is based at the University of Victoria, got approval and funding for the three-year project last spring. The planned analysis work involves collecting a diverse set of measurements from the sensors and running predictive models to produce a large library of possible scenarios. But accomplishing that will require a lot of data and a formidable amount of computing power to crunch through it all, according to Pirenne.

"To calculate [the simulations] in real time is pretty much impossible even on a very fancy parallel cloud system," he said. As a result, ONC is working with IBM to build an internal cloud architecture to handle the processing and analysis workload.

Benny BlumBenny Blum

The emerging managed services providers -- companies such as Altiscale, BitYota, Qubole, Treasure Data and Rackspace's ObjectRocket subsidiary -- claim that they can make big data cloud installations more convenient and cost-effective for willing user organizations by taking over deployment and administration tasks and offering to do so at a lower price than the cloud platform vendors charge.

Sellpoints Inc., an online marketing and analytics services provider in Emeryville, Calif., uses Hadoop together with the Spark processing engine to quickly build tables for querying "double digits of terabytes of compressed data" on the Web activity of consumers, said Benny Blum, the company's vice president of performance marketing and analytics.

Helping hands on Hadoop

Blum's group initially deployed its own Hadoop system on the Amazon Elastic Compute Cloud, or EC2, platform. But now it has relocated the system to Altiscale's Hadoop as a service offering. The service also runs on the Amazon cloud, but Blum said that offloading Hadoop configuration and management tasks is paying dividends for Sellpoints. "Altiscale manages the bare metal for us. So we don't have to pay the operational costs of maintaining the cluster," he said.

But thus far, at least, the users of such services tend to be emerging companies themselves. And most of the services providers still have customer bases in the single or double digits.

Rick Sherman, founder of consultancy Athena IT Solutions in Maynard, Mass., does see reasons to think that cloud-based Hadoop services in particular can take hold on a wider basis. "People are running up against a wall right now with roll-your-own Hadoop," he said. "It's a huge investment in time and skills. I think ultimately Hadoop as a service will be much more appealing than Hadoop on-premises."

Some organizations aren't ready for Hadoop in the public cloud at all, though -- and probably won't be for some time. That's the situation confronting Ayad Shammout, director of data platforms and business intelligence at Beth Israel Deaconess Medical Center in Boston. Last year, while working as an independent consultant, he collaborated with another consultant on a big data and cloud computing proof-of-concept project for BIDMC. Looking to reduce storage and processing demands on a SQL Server database, they used Azure HDInsight, Microsoft's cloud-based Hadoop distribution, to offload archives of application audit logs used for regulatory compliance reporting purposes to the Microsoft Azure cloud.

Shammout said the demonstration project showed the potential of running Hadoop applications in the cloud. And he thinks healthcare providers like BIDMC could eventually get there. But concerns about complying with the data privacy and security provisions of the federal Health Insurance Portability and Accountability Act stand in the way of a production deployment now, and perhaps for several more years.

"If I were talking with you about the cloud three or four years ago, I would have said, 'No, it is not going to happen,' " Shammout said. "Now I can see it for some departments. My expectation is that in another three or four years, [cloud data privacy] will not be the issue it is today."

Jack Vaughan is SearchDataManagement's news and site editor. Email him at, and follow us on Twitter: @sDataManagement.

Article 3 of 11

Next Steps

Read about the cloud-based big data system implemented by a marketing analytics company

Get more advice from consultant David Linthicum on designing a big data cloud architecture

See what big data user InsideTrack did to integrate its cloud and on-premises data

Discover three ways to build a big data system


Dig Deeper on Big data management

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Public or private cloud: Which is the better location for big data cloud computing applications?
I view this as "risk adjusted computing" and a balance between trust and elasticity. You need to decide if your data item/field should be exposed in public, community, private (corporate or outsourced) or on premises. This is a granular balance of risk vs. operational benefits.
It really depends on your application use-cases and internal skills. If your application uses mostly data generated outside the business, your big data skills are limited, or your consumption has a decent amount of variability - then using public services might make more sense. If you have strong internal big data skills, security & compliance needs to remain internal, or you don't need to take advantage of variable resources, then over longer periods of time, internal resources might be less expensive. It's very important to consider where data is generated and which additional applications it must interact with. Moving large amounts of data can be slow and costly. 
if you bring in the need to create differentiation through innovation, then the question is answered as private clouds are better as you can leverage new computing and networking architectures to speed up your infrastructure. Hardware acceleration and high performance networks make a custom cluster one or two orders of magnitude faster than AWS.

If you are not pushing hard, and don't view IT as strategic then public clouds are the best choice as it provides just in time availability.
I would say this entirely depends on:

1. the Type of data being stored in the cloud
2. the level of privacy and security controls needed to protect it
3. The company's level of tolerance for risk in possible data breeches.

No security will prevent an insider from being fooled into giving up a credential, but there are ways to protect data and obfuscate where its located in such a manner that getting access to one system may not compromise the whole.  Honestly, I think it boils down to the sensitivity of the data that may be stored.  The more highly prized it may be to potential attacking agents, the more likely it should be highly protected.

Let's also remember, that there is no reason, no reason to not use a hybrid cloud strategy.  This way you can keep your more sensitive data internally, and still leverage the strength and elasticity of the public cloud.
Good point, Veretax. It’s a good idea to consider the hybrid cloud if you’re a large organization with ample amounts of internal data center space and investments in your own IT architecture.

Get More Business Information

Access to all of our back issues View All