This article is part of an Essential Guide, our editor-selected collection of our best articles, videos and other content on this topic. Explore more in this guide:
5. - PayPal chief scientist on cracking the code for big data analytics: Read more in this section
- Big data is PayPal's prized currency, data scientist says
- Achieving big data success takes more than just technology
- Smarten up your big data with semantic technology
Explore other sections in this guide:
This article can also be found in the Premium Editorial Download "Business Information: Big data technology: Beyond the trendy tools."
Download it now to read this article plus other related content.
Zions Bancorporation gathers huge amounts of data each day -- customer details and information about online deposits and withdrawals, for example -- then feeds it all into a 1.2-peta-byte-and-growing Hadoop-based repository. The records are then analyzed to uncover anomalous patterns that may indicate fraud, theft or other criminal activity.
But it takes a lot more than headline-grabbing technology like Hadoop -- the Apache Software Foundation's popular distributed processing framework -- and related software to turn vast amounts of structured and unstructured data into insight, and that insight into action.
The problem begins with big data itself. In many cases, it is in fact big -- vastly, hugely, mind-bogglingly big, as sci-fi writer Douglas Adams might put it. And it often consists of more than conventional transaction data -- system and network logs, sensor data from industrial equipment, social network posts and other text data. Then comes the challenge of spotting glimmers of useful info amid that enormous space and sprawl. It's one thing to collect big data; finding business value in it is a much bigger undertaking -- one that could make your head hurt and burn a hole in your organization's budget.
Contrary to what many technology ads would have buyers believe, achieving a great return on investment is also about creating the right team, putting a solid business strategy in place, being agile and testing -- lots and lots of testing, according to Zions and others who use Hadoop, NoSQL databases and similar tools that have come to define the burgeoning era of big data computing.
At Zions, which first launched its fraud analytics program nine years ago, achieving big data success is a moving target that requires both advanced technology and keen intellect. Finding needles of useful information in haystacks of data has become more formidable as data volumes have exploded over the past decade. But Zions' bank fraud and security analytics team has worked continually to build and refine statistical models that have repeatedly helped bank executives predict, identify, evaluate and -- when necessary -- react to suspicious activity.
"People see all the advertisements and think big data can even clean your house for you," said Michael Fowkes, senior vice president for fraud prevention and security analytics at Zions. "But I believe that we've had success because we've approached this as a team."
Building a winning big data team
Salt Lake City-based Zions owns banks and financial services firms and uses an open source Hadoop package provided by MapR Technologies -- though according to Fowkes, the company's experience with data warehouse appliances and other, more traditional tools for crunching large and complex data sets goes back much further. Zions uses Hadoop primarily as a data store for server, database, antivirus and firewall logs and transaction data related to online banking systems, wire systems and customer databases.
Fowkes believes that building the right team is the key to turning a morass of information into clear insights that can be acted upon. At Zions, a small squad of data scientists was assembled and tasked with building algorithms and statistical analyses that help Fowkes' security crew discover unusual trends or outliers in the data that point to criminality.
The data scientists also work to cancel out the noise -- or the useless data -- that usually exists in large and complex data sets of varying types. "Big data equals big noise," he said. "The data science folks filter out all of that stuff to figure out what is really interesting."
But putting together a data science team was no easy task. The company wanted to begin small and expand the program gradually, building on successes over time. That meant resources had to be allocated wisely, so many open slots were filled with members of the Zions security crew. The bank also made additional investments to increase its data analytics skills (see "Big data systems shine light on neglected 'dark data' "). Then Zions recruited data scientists with backgrounds in statistics and advanced mathematical modeling.
A big, big market
As pools of sensor, social media and other Web data continue to expand, so does the big data product market, so businesses need to remain flexible and open to new possibilities. Zions has no immediate plans to move off of MapR but continues to keep pace with new developments in the market just the same.
"With Hadoop specifically we constantly take a look at what is out there and what is available," he said. "We're not against swapping out the technology stack we're using if there is a compelling reason to do so."
More organizations are showing a lot more interest in big data technologies, according to Gartner. Industries seeing the biggest adoption are media and communications, banking, service and education.
MapR’s main competitors in the big data technology market are Cloudera and Hortonworks, which also offer commercial versions of the open source Hadoop file system. But customers can expect more vendors to throw their hats into the Hadoop ring as the market evolves. One that already has is microprocessor giant Intel, which surprised customers recently with its own Hadoop distribution. Storage vendor EMC has also released a commercial Hadoop distribution.
IT industry analysts and Hadoop users say they expect the technology to grow even more popular as related software tools like Hive -- Apache's data warehousing application that's used to query Hadoop data stores -- start to look more and more like traditional SQL-based data management tools. At Zions, the advent of Hive made a major difference to the security team's operations.
"I wouldn't say the learning curve was real steep, especially because we decided to use Hive," Fowkes explained. "The previous systems we've had used a SQL-like front end and that is what Hive gives you -- relational database type of access to Hadoop and big data."
Put to the test
Boxed Ice is another company that's achieving big data success with specialized technology -- in this case a NoSQL database called MongoDB. Headquartered in London, Boxed Ice offers a hosted software product called Server Density that monitors the health of cloud computing deployments, servers and websites for about 1,000 clients around the globe -- a task that requires copious amounts of data processing.
"We monitor quite a few websites and servers for customers like EA, Intel and The New York Times," said David Mytton, Boxed Ice's CEO and founder. "We are processing about 12 terabytes of data every month with MongoDB, and that equates to about 1 billion documents each month."
The decision to use big data technology has given Boxed Ice's small staff the freedom to focus on what it does best -- troubleshoot servers and websites -- without having to worry so much about storage capacity and data processing issues that take up valuable time.
"We don't really have to think much about how much data we're now storing. We can just put it all into MongoDB," Mytton said. "For the most part, MongoDB handles it, as long as we understand the data and create indexes to make sure that queries are fast and that we've got the necessary capacity. It means that we can concentrate on building our product without having to deal with operational issues."
Mytton originally built Server Density on top of the open source MySQL relational database but switched to MongoDB in 2009 when data volumes handled by the monitoring service grew unwieldy. MySQL simply couldn't keep up. The problem worsened when Boxed Ice began deploying its software on multiple servers.
For more on big data management
Read this excerpt from Mark Scott's book on data warehousing to glean more about managing vast quantities of data
Learn how you can improve data quality management with these best practice tips
Get all the details about Network Rail's implementation of the Informatica platform
"[The data] was all being pressed into MySQL and we were having problems, particularly with replication," Mytton said. "We were at an early stage and we wanted to make sure that if one of our servers failed we could continue to provide service, and it was really difficult to get that set up with MySQL."
As with many big data technologies, MongoDB is still a new product, and though it has some high-profile customers -- MTV Networks, Craigslist and Foursquare among them -- its user base is relatively small. Getting it to work properly was largely a matter of testing, tuning, keeping up with documentation and paying attention to what the open source community has to say.
"We had quite a few problems in the first year and a half or so using MongoDB, whether that was just bugs or general problems, [and] we had to work with the [10gen] engineers to get these things fixed," said Mytton, who added that 10gen has been putting a great deal of effort into the task of creating detailed documentation for the database. Since Version 1.8 of MongoDB was released in March 2011, "everything has been incredibly stable and we've had very few problems at all."
Mytton said organizations evaluating big data technologies should remember to test them in conjunction with their own data sets or applications, as opposed to using some arbitrary data set or app a salesperson has on hand.
"There are quite a few different benchmarks available that [use data] that is interesting from an academic point of view if you want to compare versions. But it doesn't really give you any kind of real-world benchmark," Mytton said. "It's important to run the queries that you're going to expect [in the real world]."
A big selling point
With the right team and strategy in place, big data technologies like Hadoop, Hive, Pig, Cassandra, Mahout and others can open up a world of predictive possibilities for companies, according to Jeffrey Kelly, an analyst at Wikibon, a Boston-based IT research and advisory group. But one common reason to get started with Hadoop has as much to do with inexpensive data storage as improved analytics capabilities.
IT professionals seeking a solid business case for Hadoop, for example, could start by demonstrating to company decision makers that the technologies that make up the open source big data ecosystem can lead to huge savings on data storage and analysis.
Hadoop and other open source big data implementations offer a far less expensive alternative to traditional, proprietary data warehouses provided by large software companies like Oracle, IBM, Teradata, Microsoft, SAP and EMC. As a result, Kelly said, a growing number of organizations today are "offloading," keeping only recent transactional data in a data warehouse and moving the rest to a Hadoop cluster.
The benefits of offloading are plentiful, said Kelly, who focuses on big data and business analytics market research. For starters, offloading helps organizations stabilize the amount of money spent each year on data warehouse capacity, licensing and support. That's because the amount of data stored in the traditional warehouse remains the same over time.
"With data volumes growing, you're going to have to invest more and more into your Oracle data warehouse or your Teradata data warehouse and that is quite expensive," Kelly said. "So, what some early adopters are doing is leaving, for example, the most recent six months of data in a Teradata or Oracle data warehouse, and everything older than that is offloaded into Hadoop."
At Zions, the decision to use Hadoop has given the company much more than a centralized location for conducting data forensics predictive analytics and risk management activities – it has also led to significantly lowered storage and capacity planning costs, and a big boost in analytics speeds.
"Hadoop gives us a place where we can store data at a reasonable cost, and some of these data sets that we have are quite large," Fowkes said. "With other technologies you could still report against them but it might take you hours up to a day to get a result set back. With a reasonably sized Hadoop cluster you can get an answer back maybe in 20 minutes."