Search functionality is being added fast and furiously to the Hadoop ecosystem. And since an open source nature is a big part of the Hadoop ethos, the technology of choice in this case is often the open source Lucene search engine. In fact, it's like a reunion of old friends: Apache Hadoop originally started as an offshoot of the Apache Lucene project.
It makes sense for Hadoop and search tools to come together. Recent attention around the use of Hadoop clusters and NoSQL databases for managing unstructured and semi-structured data has obscured the substantial progress made over a longer period in text search technology -- improvements from which Hadoop users stand to benefit as they look to distill business value from the streams of big data cascading into their systems.
There's literally text everywhere, and its easy searchability is becoming an expectation among business users accustomed to quickly finding all manner of information online through Google searches.
Google's search engine is so familiar to people that it could be called a known known, to use former U.S. Secretary of Defense Donald Rumsfeld's off-beat parlance for something that we know we know. On the other hand, searches run as part of Hadoop applications may help uncover "unknown unknowns," another zen-like Rumsfeldian coinage that means "things that we don't know we don't know." Although the phrase seems a bit nonsensical, it does have application in business analytics, where ways of doing things are changing.
The big thing … today is that people don't pretend to know the right question to ask ahead of time.
Unknowns make good
There's another stage between known knowns and unknown unknowns. That is "known unknowns" -- things that we know we don't know. From a business intelligence and analytics standpoint, this is where mainstream BI tools, relational databases and data warehouses have served stalwartly over many years. They were built to help end users answer specific, often predefined questions on business operations.
But a lot of organizations are now looking to supplement that approach with more freewheeling analytical methods. That, in turn, is driving vendors to connect Hadoop to the Lucene search software.
One of the big potential benefits of Hadoop systems is enabling organizations to gather large amounts of data and then figure out later what to do with it. Looking at Hadoop data through the lens of search technology provides a way to examine it from different angles, creating opportunities to gain insights that might not be uncovered with canned query methods.
"The big thing about deployment of big data tools today is that people don't pretend to know the right question to ask ahead of time," said Dan Kusnetzky, founder of market research and consulting company Kusnetzky Group LLC. "While with traditional transactional and business systems, they know what questions they have."
Searching for clues in Hadoop data
As a result, he added, companies need to change how they manage and sort through the "huge mounds of data that they're collecting" in Hadoop clusters in order to make effective analytical use of the information. "Now, you store the data in a way that makes it easy to query based on the questions on your mind, instead of in the manner of traditional business intelligence that was based on certain key queries," Kusnetzky said. "That is no longer as useful."
More on Talking Data
Read a yarn about Hadoop and development dichotomies
Are you ready for business decision making in the age of big data?
Learn more about Hadoop features as they evolve for the enterprise
Search additions to mainstay Hadoop offerings have been evidenced in recent product moves. For example, Cloudera Inc. this month made a Lucene-based Cloudera Search tool generally available for use with its Hadoop distribution.
Earlier in the summer, MapR Technologies Inc. began distributing Lucene-derived search functionality with its namesake Hadoop platform. And LucidWorks, which offers enterprise-class search development platforms built on the core Lucene search engine library and its companion Apache Solr search server, has forged alliances with all three of the Hadoop distro "pure plays" -- Cloudera, MapR and Hortonworks Inc.
With the upcoming release of a documentary movie about Donald Rumsfeld, slyly titled The Unknown Known, you may hear more discussion of the concept of knowns and unknowns that he introduced. Meanwhile, business analysts, data scientists and other analytics pros will take new hacks at new piles of data, looking to turn informational unknowns into exploitable known quantities -- in many cases aided by the combination of Hadoop and search tools.
This was first published in September 2013