Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
While still seeking definition, "big data" applications are splitting into at least two major, definable camps...
-- big data at rest and big data in motion.
High volume, high variety and high velocity of data together have come to represent big data applications for many viewers, but this last trait -- velocity -- brings special challenges that distinguish big-data-in-motion applications from their big-data-at-rest cousins.
There is no point in knowing that a fraud was committed after the money is gone.
CEO, Tibco Software Inc.
"Data in motion is streaming data. With big data at rest, [data] comes in and you store it and you ask a question," said Roy Schulte, vice president and distinguished analyst at Stamford, Conn.-based IT research firm Gartner Inc. "That is, as opposed to continuous analytics."
Big data at rest resembles conventional data warehousing -- albeit with more data (and more unstructured data) than ever before. It most likely uses non-relational database technologies -- as opposed to traditional relational data warehouses -- to tame the data. Like data warehousing, big data at rest is focused on the needs of analytics or business intelligence applications.
Big data in motion, on the other hand, tends to resemble event processing architectures, and focuses on real-time or operational intelligence applications. It is being used in Wall Street trading and fraud detection, and for monitoring social media, sensor networks and Web applications.
To feed the hungry new Hadoop and MapReduce big data stores in these operational applications, data architects could be forced to contend with a variety of advanced middleware infrastructure components. These might include in-memory data grids, complex event processors, streaming databases and more.
The Hadoop-style frameworks offer improved distributed processing compared to traditional data warehouses -- yet nevertheless, they work on data in batches, which can result in unacceptable latency. "Typically, when you bring in the data, map the data and reduce it, it takes time. If you bring in a new bit of data, it's going to take 5 to 10 minutes before you can ask your first question," Schulte said. If the type of response time needed is "milliseconds, seconds or even a minute," he said, "Hadoop would be inadequate."
Curing latency with big data in motion
Where do you find that your big data implementation problems are actually problems with data in motion?
There are ways to discern whether big data implementation problems are actually problems with big data in motion. "They probably appear where latency is the primary concern -- where people are not just reporting day-to-day," said Jay Kreps, principal staff engineer at LinkedIn Corp. and an Apache Foundation project committer for Kafka, a distributed messaging system. For some operational applications, this latency issue arises when quick results are required but the analytical engine needs to populate an entire data set before launching a query.
To succeed in real time, organizations may need to work with a variety of software types, including event processing engines, fast messaging systems and Hadoop analytical tools, Kreps said. At the same time, users will likely want to support reporting systems that can run their jobs on a day-to-day basis -- these will not require the same kind of latency at all.
Kafka is a relatively low-level piece of infrastructure, especially as compared to established data warehousing tools, Kreps said. It's built as a distributed system, and can work with real-time jobs in milliseconds and with offline processing jobs that, for example, might run daily.
Big data goes real time
The infrastructure software that enables faster big data feeds is varied. Packages such as the Kafka messaging system have open source companions in the Flume integration framework for populating Hadoop, as well as the AMQP messaging system and its RabbitMQ sibling. On the commercial front, IBM, Informatica Corp., Real-Time Innovations Inc., Red Hat Inc., Solace Systems Inc., Streambase, Terracotta, Tervela Inc., Tibco Software Inc., Vitria and others offer various low-latency or streaming software tools.
Some of the software tools for big data in motion come from what were originally Wall Street trading software efforts. "We've had big data for a long time, we just didn't realize it," said Barry Thompson, founder and CTO at Tervela, which claims financial market customers including Goldman Sachs. His firm has created a data fabric, or intelligent grid, for moving data around. The company's Turbo data fabric collects and transports data from various sources. In December, the software was certified for use with Cloudera Inc. Hadoop technologies.
"Loading Hadoop is a challenge. It is a bottleneck," Thompson said. "It is fine if you don't mind losing a couple of [log files], but on Wall Street, losing a cash flow in an interest rate analysis is a bad thing." Like others, he sees issues for Hadoop in operational applications. "A lot of big data today is an hour old or a day old. If that was the case on Wall Street," he remarked, "you could lose your shirt in an instant." The world beyond finance is looking for such speed, he asserted.
Infrastructure needs will change as more companies seek to solve real-time big data problems, said Vivek Ranadive, CEO at Tibco, an infrastructure software company that recently passed the $1-billion-per-year revenue milestone. He includes event processors and ultrafast messaging systems among the important infrastructure elements in what's been called "fast big data."
Whether it is a bank or, for that matter, a sports team, "every company is going to redefine its business as a social network," Ranadive claimed. For operational business intelligence, the goal is to deliver data almost before the fact. "There is no point in knowing that a fraud was committed after the money is gone," he chides.