Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
The data arrives in ever greater amounts at Solutionary Inc., an Omaha, Neb.-based managed security services provider. Handling that data with familiar database technology proved difficult, so the company's software development leader called on new Java-based Hadoop data technologies including MapR HBase, a data store often implemented along with Hadoop.
At the outset of the project, Scott Russmann, Solutionary's director of software engineering, saw an ever growing need for expanded use of Oracle Database with Real Application Clusters (RAC), but that meant greater cost on a per-CPU basis.
"We look at advanced persistent threats. It's like a needle in a haystack of haystacks," Russmann said. "As a result, our storage needs were growing rapidly. As your [storage and processing] needs grow, your CPU costs grow."
As he described it, the company deploys its own ActiveGuard security software in customer data centers, and the software in turn gathers forensic log data for analysis by Solutionary's in-house security experts. The experts then correlate and compare log data to other contextual data to disclose patterns that point to security threats.
Facing what he called "a bit of a data availability conundrum," Russman and his team looked to the emerging Hadoop distributed data clustering framework and its associated MapR HBase NoSQL key-value store.
Russmann gives good grades to that software for scalability and performance, the latter being a trait with which some HBase users have struggled. He said he finds MapR performance enhancements are useful for both Hadoop and HBase portions of his application.
"MapR's Hadoop file system was the savior for our storage needs. Now we have a horizontally scalable storage mechanism and a processor-storage ratio that is appropriate as you scale," he said.
Oracle RAC is used now for contextual data and metadata, he said, while MapR HBase is used to weave contextual and raw log data. As further scaling is required, Solutionary can stay ahead of the "Oracle RAC growth curve" while protecting customers against new security threats, according to Russmann.
Who's on MapR HBase?
One early HBase performance issue Russmann uncovered -- and he is not alone -- is a matter of plumbing, one that few data professionals will wake up in the morning and look forward to conquering. The plumbing matter is garbage collection. Put in Russmann's words: "Large memory management with Java is problematic, largely due to garbage collection."
HBase, like HDFS and other technologies running on Java technology, can encounter issues in garbage (or unused software object) collection. Java largely did away with coders' need to manage memory heap cleanups, at least compared to C++, so its occasional garbage collection issues usually come as a surprise. But there are cases, especially in large-scale highly distributed implementations, where standard Java collection is inadequate.
Russmann, like others, has seen that as HBase has taken on more and more data work, heap sizes have grown. Processing sometimes halts during a pause for garbage collection that proves to be too long. This is something that programmers can fix. But the effort can be considerable. Russmann found favor with MapR's approach to garbage collection, which amounts to a full-scale assault on the problem.
Whither the true Java flame?
MapR HBase performance enhancements were central to MapR's recent M7 data management platform update, according to Jack Norris, who is the company's chief marketing officer. Speaking at last month's Strata East 2013 in New York, he said MapR rewrote its underlying storage layer to handle files and tables together, while redesigning its architecture so that HBase applications run directly on the MapR platform without data compactions.
More Hadoop on 'Talking Data'
Moreover, MapR took Java out of the M7 platform equation. MapR wrote M7 in C/C++, thus, forgoing two layers of Java Virtual Machines (JVMs) and standard Java garbage collection.
Back in the day, that kind of move would have brought down the wrath of the keeper of the true Java flame -- Sun Microsystems. To be 100% compatible, you had to be "all Java," and not mix and match languages as programmers had often done before.
Sun would take away your Java coffee cup logo for much less than what MapR has done. These are different times, but making these kinds of changes to big cogs in the Hadoop ecosystem is still not without controversy.
Hadoop's open source underpinnings are a big part of its quick ascent. Committers to the Apache HBase project, who have been working to improve HBase garbage collection details, point out that what MapR has done can be called "HBase," but not "Apache HBase."
At Strata East, Norris told a crowd that, while Hadoop is a market created by open source technology, "it's appropriate, and, in many cases required, to combine open source code with innovations to meet customer requirements." That has been part of the MapR Hadoop and, now, HBase strategy from the get-go and a distinguishing trait versus Hadoop upstarts Cloudera and HortonWorks. For an application like Solutionary's, immediate practical issues win out, but the MapR strategy will be tested as the Hadoop ecosystem continues to mature.