Manage Learn to apply best practices and optimize your operations.

Dip in Hadoop data lake can be bracing for big data users

Encouraged by the promise of cost savings and better efficiency, early adopters are wading into Hadoop as a central reservoir for their analytics data.

This article can also be found in the Premium Editorial Download: Business Information: Launching big data initiatives? Be choosy about the data:

Customer relations is the cornerstone of service-oriented companies. A slogan like Allstate Insurance Co.'s "You're in good hands" says it all. But behind that friendly tagline, there's a business to be run. Creating a great customer experience in every aspect of the insurance process is one of Allstate's goals -- but so is making money.

To help it meet those twin goals, Allstate has deployed a Hadoop-based data lake to support advanced analytics applications aimed at improving its business operations. Data analysts such as Mark Slusar, a quantitative research and analytics fellow at the Northbrook, Ill., insurer, are using the Hadoop system to fish through decades' worth of data that until now was floating around in different databases. The open source framework's distributed processing capabilities let Slusar and his colleagues explore large sets of data on policies, claims and property losses in an effort to identify patterns, trends and insights that point to new business opportunities and beneficial changes in Allstate's strategies and processes.

"Previously, a lot of the data we looked at was only at the state level because data at the country-wide level was so large that we didn't have an effective way to work with it," Slusar said. But the data lake, based on Cloudera's Hadoop distribution, puts nationwide information in the grasp of the analytics team. Now the data "is more organized and centrally located, and the computing power is leaps and bounds faster than before," he said. "What used to take months now takes hours."

Hadoop -- once used primarily by large Web companies like Yahoo, Facebook, Twitter and Google -- works across clusters of low-cost, commodity servers and storage systems. The cost savings that offers is one reason why companies in other data-intensive industries, such as telecommunications, healthcare, manufacturing and financial services, are jumping on the Hadoop bandwagon. The data lake concept takes Hadoop deployments to their extreme, creating a potentially limitless reservoir for disparate collections of structured, semi-structured and unstructured data generated by transaction systems, social networks, server logs, sensors and other sources. And in the most extreme cases, Hadoop becomes the centerpiece of analytics architectures.

Mark SlusarMark Slusar

Once all that info is in the pool together, the theory is companies can apply analytics across the top to help increase operational efficiencies, boost sales and create a more connected and personalized experience for customers. In addition, filtered subsets of the data can be made available for analysis inside the Hadoop system or sent off to data warehouses and NoSQL databases for end users to access.

A prize analytics catch

One area ripe for improvement that emerged from the depths of Allstate's databases through its data lake was the process of underwriting homeowner policies, which typically can’t be done until a property inspection takes place. That usually costs Allstate a few hundred dollars for a home inspector, plus it disrupts the prospective customer's day. And sometimes, it's just not necessary, Slusar said. So Allstate's analytics team used the Hadoop system to identify when it was OK to skip an inspection.

"We were able to go through historical data for different neighborhoods and apply predictive algorithms, which identified areas where we could cut out inspections," Slusar said, adding that the number of inspections being done was reduced by 20%. That saved the company more than $3 million in 2014, according to Allstate officials.

The data lake blends the historical data with new info -- for example, sensor data transmitted via cellular networks from cars, which Allstate uses to monitor mileage, driving and braking speeds, hours spent on the road and other metrics for auto insurance customers who can qualify for premium discounts if they're deemed to be safe drivers.

Jack NorrisJack Norris

Hadoop clusters often start out as less grandiose data stores that function more as feeder systems than alternatives to traditional data warehouses.

"Some treat it as an initial landing zone and use it to figure out what data will be processed and sent downstream," said Jack Norris, chief marketing officer at Hadoop vendor MapR Technologies. Turning such systems into full-fledged data lakes that support a variety of analytics uses and applications is a big step up. "To make that leap," Norris added, "you have to have enterprise-grade features that include the same SLA and data protection capabilities that are in the data center today."

That's where Cloudera, MapR, Hortonworks, Pivotal Software, IBM and other Hadoop providers come in -- they say. MapR, for example, takes open source Apache Hadoop and makes the data layer more easily accessible, Norris said. "A lot of what we built in smooths out the rough edges to support broader workloads."

Broader horizons for Hadoop

In addition, the release of Hadoop 2 in late 2013 broke the technology's dependence on the MapReduce batch-processing engine and programming framework. Now Hadoop can also run other types of applications -- stream processing and interactive querying, for example. And SQL-on-Hadoop tools designed to make using the technology easier for end users are starting to emerge -- things like Cloudera's Impala and Apache Drill, which runs SQL queries natively against multi-structured data sets.

Solutionary Inc., a managed security services and threat intelligence provider in Omaha, Neb., uses MapR's Enterprise Database Edition distribution (formerly called M7) to speed up analysis processes on its cloud-based ActiveGuard security platform. Solutionary collects massive amounts of structured and unstructured information from its clients' networks, databases and applications; processes and stores the data in a MapR-based data lake; and then analyzes it in an effort to detect security threats.

In the Hadoop system, Solutionary's information security and threat research teams can use tools such as Drill to do what-if analysis with subsecond response times, said Scott Russmann, the company's director of software development. The data lake setup also enables Solutionary to maintain the information in the varying data structures and formats used by different clients, instead of having to force-fit it all into a single fixed schema.

"It really is an upside-down concept considering where we've been historically," Russmann said. "Traditionally, you have a database administrator who defines the data model -- and thou shall not operate outside of the data model. This changes that significantly."

But Russmann cautioned that data lakes aren't for everyone. For one thing, too much data flexibility can be a dangerous thing. "It's a huge culture change,” he said. "A lot of people buy into the hype and don't think about how to structure the data. They just dump data into this thing and structure it on the fly. That can be a huge cost and burden, and you can dig yourself into a hole."

Swimming with the fishes?

One thing is clear about data lakes: If a project isn't done right, it could end up dead in the water. Gartner analyst Nick Heudecker said that not paying proper attention to issues such as security and data governance "it could result in piles of information that could be breached, or from which bad decisions could be made." In addition, he said it takes analytics skills that are in short supply to wring tangible business value out of the information in a data lake.

Even Hadoop vendors acknowledge that it's a complicated process. "The challenge is, it's not that easy," said Matt Brandwein, director of product marketing at Cloudera. Sai Devulapalli, head of product marketing and data analytics at Pivotal, noted that data lake technology is still in the nascent stage of development and that the required technologies aren't simple to use. There also aren't a lot of examples to point to as deployment guides -- despite all the hoopla about Hadoop, surveys still show its overall adoption rate in the low double digits.

Lonesome lakes graphic

But the potential benefits are as vast as the amount of data that can be supported. At Allstate, the data lake is enabling the company's data scientists, IT team and business users to work together to use data proactively, not just reactively, to find ways to reduce costs and improve customer service.

"Efficiency is important here," Slusar said. "Having the ability to use big data to do something in a matter of hours frees us up to do so much more. Not just with the data, but with the business."

Next Steps

Learn why the data lake remains an emerging concept

Learn more on the data lake approach to big data management

How to select the right SQL-on-Hadoop engine to access big data

This was last published in February 2015

Dig Deeper on Hadoop framework

PRO+

Content

Find more PRO+ content and other member only offers, here.

Start the conversation

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

-ADS BY GOOGLE

SearchBusinessAnalytics

SearchAWS

SearchContentManagement

SearchOracle

SearchSAP

SearchSQLServer

Close