Big data tutorial: Everything you need to know
A comprehensive collection of articles, videos and more, hand-picked by our editors
Software vendors have gotten the message that Hadoop is hot -- and many are responding by releasing Hadoop connectors...
that are designed to make it easier for users to transfer information between traditional relational databases and the open source distributed processing system.
Oracle, Microsoft and IBM are among the vendors that have begun offering Hadoop connector software as part of their overall “big data” management strategies. But it isn’t just the relational database management system (RDBMS) market leaders that are getting in on the act. Data warehouse and analytical database vendors such as Teradata and Hewlett-Packard’s Vertica unit have also built connectors for linking Hadoop to SQL databases, as have data integration vendors like Informatica and Talend. Vendors of Hadoop distributions, including Cloudera and MapR Technologies, are in the connector camp as well.
What is Hadoop?
Hadoop is a Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
Developed by the Apache Software Foundation as an open source project, Hadoop was originally based on Google’s MapReduce programming model, which lets developers break down applications into numerous small tasks that can be run in parallel on different computing nodes in clustered systems.
Hadoop makes it possible to run applications on clusters with thousands of nodes and terabytes of data; the Hadoop Distributed File System manages storage, facilitates data transfers among the nodes and enables a cluster to continue operating uninterrupted if individual nodes fail.
Organizations mulling the possibility of using connectors to link conventional database systems to Hadoop clusters should think about “where the best place is to analyze or search or sort or whatever it is you’re trying to do with your data,” said Rod Cope, an experienced Hadoop user who is chief technology officer at OpenLogic Inc. in Broomfield, Colo.
OpenLogic uses Hadoop in combination with HBase, a column-oriented NoSQL database that is part of the Hadoop framework, to keep track of open source software projects around the world. It’s all part of the company’s flagship service, which helps corporate customers audit software applications to verify that the use of embedded open source code complies with relevant licenses. OpenLogic has yet to deploy any connectors, but Cope has looked closely at the technology -- for example, as a possible means of moving infrequently accessed data from a relational database to HBase for archiving.
The connectors don’t magically solve all of the issues involved in such pairings, according to Cope. He cautioned that prospective users should be aware of just how long it can take to load data from a database into Hadoop. “It’s easy for people to forget when you have truly big data that anything you do with it takes a very long time,” Cope said. Typically, he added, “it’s not the Hadoop side that’s slow; it’s wherever you’re trying to load it from.”
David Menninger, an analyst at Ventana Research in San Ramon, Calif., said the Hadoop Distributed File System and specialized databases built on top of it are good at providing users with a place to manage and analyze information that doesn’t fit neatly into a traditional RDBMS or data warehouse. That might include machine-generated forms of big data, such as application, search and website event logs, plus social media information, mobile phone call detail records and other “stuff that just simply wouldn’t normally be thought of as structured relational information,” Menninger said.
One of the most common use cases for a Hadoop connector, he said, involves an organization using a Hadoop system to extract a small amount of structured analytical information from a much larger amount of unstructured data, then transferring that information to an RDBMS for further analysis and reporting with business intelligence tools.
Hadoop connector motto: Everything in its right place
“The reason you put it into a relational database is because you can’t easily report on Hadoop data sources today,” Menninger said. “We have a whole industry of tools that has evolved for reporting on and analyzing relational data.”
Such data transfers don’t have to be a one-time deal. “Maybe you were counting occurrences of a certain event and later decide that you want to count the number of times that two events occurred together,” he said. “You go back to the source files and process the information again. That’s why people don’t throw the [unstructured] data away. They leave it in Hadoop.”
In addition, Hadoop provides a much better environment for some advanced analytics and data mining applications than a SQL-based relational database does, Menninger said. One example he cited involves analyzing customer service call logs in combination with postings on Twitter, Facebook and other social media sites to try to identify customers who are likely to stop using a particular product or service.
For more on Hadoop and big data management
Learn why Forrester analyst James Kobielus thinks it’s time for a Hadoop standards body
Read about the maturity levels of Hadoop and other big data technologies
Find out why there’s more to managing large data sets than deploying Hadoop systems
“Those are hard things to express in SQL,” Menninger said. But, he added, the analytical results can then be sent via a Hadoop connector to a relational database or data warehouse for further analysis and reporting and to drive follow-up actions aimed at keeping customers from defecting.
Cameron Befus, vice president of engineering at Tynt Multimedia Inc., a Web analytics company in Sausalito, Calif., that was acquired by 33Across Inc. in January, said his organization uses Hadoop to provide analytics services for more than 500,000 publishing websites. In addition, Tynt runs Oracle’s open source MySQL database to power its back-office operations.
Thus far, Befus hasn’t seen the need to install connector software to integrate the two environments. “We do move data around a little bit, but it’s usually pretty straightforward,” he said, adding that the company directly loads files from Hadoop into MySQL. “A connector might make it slightly easier, but that just hasn’t been a problem for us.”
Nonetheless, IT analysts such as Menninger and Judith Hurwitz, president and CEO of Hurwitz and Associates in Needham, Mass., expect demand for connectors to gradually increase as more organizations become Hadoop users.
Like Menninger, Hurwitz thinks interest in the technology will be driven by companies looking to put the results of Hadoop-based analyses into a greater business context.
“When you’re looking at [big data], what you’re looking for is, ‘What is this data telling me about some critical issue?’ ” Hurwitz said. “[Users will] want to build bridges between this unstructured, streaming, ‘get a sense of things’ data and the very structured data that may include the details about how your company may be addressing those issues.”