Big data adoption presents new challenges to data managers already grappling with multiple data silos. A rash of recent open source Apache Hadoop and NoSQL technology innovations could augur more of the same. Few have a closer view on the phenomenon than Ron Bodkin, co-founder and CEO at Think Big Analytics. This Mountain View, Calif., consulting and development services provider has especially focused on the Hadoop ecosystem, machine learning in Hadoop, real-time querying and other aspects of big data. He sees a lot of people offloading warehouse workloads, but there is much more going on as well.
What is a current situation you see when people begin to take the road to Hadoop?
Ron Bodkin: There's a range of uses. Some uses are as cost optimizations. And there are companies that are trying to work with data that they could never work with in existing warehouses -- things that were formerly impossible or cost prohibitive. Some of the first projects people want to do are cost optimizations.
For example, people hadn't been able to keep a year of detailed data or raw, event-level data from logs. Sometimes, before, in traditional environments, the analysis of such data was infeasible. They may now want to work on click-stream analysis of a website, and they really want to get deep inside and see patterns of interaction on who is likely to buy. When they try to do these kinds of analyses in a data warehouse, they find they are resource intensive and can be expensive, and the analysts can't all really play along.
We even see companies that have done a certain amount of processing on the mainframe, and are interested in moving that logic off to Hadoop.
Ron Bodkin, Think Big Analytics
We see a range of databases people are running -- Oracle, Teradata, Netezza, Vertica and so on. They may look to shift the data-preparation workload out. We even see companies that have done a certain amount of processing on the mainframe, and are interested in moving that logic off to Hadoop, because it is cost efficient and also because it is increasingly hard to find the Cobol programmers to maintain that logic.
Is it the case that, once companies have departmental success, there comes a drive to consolidate some of the Hadoop efforts centrally?
Bodkin: Yes. Once companies get to that level of success, we do see a desire to centralize -- because so much of the value of big data is in breaking down silos. It is about being able to put together different data sets in new ways. So, instead of having lots of different applications in different databases, you have a more common form of data storage and a way of getting deeper insight that can serve a variety of functions in the business.
Ways that we have seen as being successful are having cross-functional teams that will start by brainstorming opportunities to use data and analytics to drive better outcomes. They should put together a roadmap that looks at ways of changing the organization to be more data driven, and then identify specific projects to deliver that value. It could be creating better offers to respond to customers. It could be new ways to deal with fraud, risk and compliance and to use more than just transactional data streams, but instead add additional data sets like text data and social data.
It's not just Hadoop. It's also NoSQL. Are there times when people start out and try to apply one when they should apply the other?
For more on Hadoop big data adoption
Learn what Hadoop brings to a data warehouse.
Get updated on using big data and Hadoop 2
Bodkin: Yes, definitely. There's a range of understanding. People may use a NoSQL database when they need massively scalable batch analytics, and generally, NoSQL databases are not designed to support that kind of throughput. People will pick one tool, often a NoSQL database, and use it for a variety of purposes, not realizing they may have grown into a need to have more than one technology. People may start with Hadoop and not realize it is time to add a NoSQL database for low-latency access to data as well.
Do you have to have a highly skilled data scientist to figure it all out? Is big data adoption really open to business users?
Bodkin: I think there is a range of skills and capabilities that benefit from Hadoop and big data analytics. Often customers start by asking questions that aren't deep data science questions but are instead basic questions they could never answer before. So, analysts and business users get a lot of value out of putting together a big data system where the data is available and they start to ask questions. Once you do that, it opens up the opportunity to do data science and to do deeper analysis.
For many companies, the first stage of their journey is about getting agile insights by just looking at the data, and then to do more data science. A lot of times there is this misconception that you have to leap to predictive models. A lot of the work starts with gaining basic insights, and then shifts to automation and prediction as the organization matures.