michelangelus - Fotolia
Today, masses of information from the Web are meeting up with data processing platforms that let users store and manage data inexpensively -- Hadoop, for example. It's a potent combination that has led companies such as TrueCar Inc. to pursue new data-driven business models specially tuned to exploit the full breadth of the information assets available to them.
John Williams, senior vice president of technology at TrueCar, recounted the question posed internally: "We asked, 'When data becomes very cheap to store, what does that mean for the business?'" The company, based in Santa Monica, Calif., has built an e-commerce site to match car buyers and sellers. That kind of application has been seen since the first days of Web-based commerce. What's new is an emphasis on transforming information -- sometimes called "exhaust data" -- because it arises as a by-product of regular business activity into profitable products or services through the use of a Hadoop system and related tools from Hortonworks. The company continues to field enterprise data warehouses, sure, but Hadoop is where much of the innovation comes from.
"The idea is that you can instrument your business and capture side-effect data that may yield new products," Williams said. "That can apply to every business everywhere. The new product may even become the biggest part of a business."
For example, TrueCar places a special focus on analyzing data about the activity of car buyers and prospects on its website. With its big data architecture, the company is able to mine that data and then advise the car dealers who work with it on best practices in pricing and Web sales techniques.
Taming messy data brings business benefits
Because it stores data cheaply and supports a variety of tools for downstream analytics, the Hadoop distributed processing platform has been a big part of TrueCar's big data push. "My job," Williams said, "is to use technology to give TrueCar a business advantage. In that capacity, I'm looking at things like Hadoop and realizing it's more than just some cool infrastructure thing. Really diving in is going to be a big advantage down the road."
Open source Hadoop came into play at TrueCar in great part because Williams and others there saw shortcomings in established methods for analyzing large amounts of quickly accumulating data. A principal issue was that the car data the company collects from the Web is highly varied in its structure.
"The data is very messy," Williams said. "That is where existing infrastructure started to run into limits. We had to get good in the way we ingested messy data that was sometimes structured, sometimes not."
An early example of TrueCar's use of Hadoop is somewhat mundane. The company, needing to resize millions of car photos every day, found a simple Hadoop job could accomplish the task quickly due to the platform's powerful parallel computing capabilities.
But further uses of the technology were on the horizon for TrueCar, which went on to create a Hadoop data lake that currently holds information on vehicles, transactions, registrations, buying behavior and more. Hadoop "is equivalent to an operating system," Williams said. "When you think of it that way, it opens your mind."
Some bumps in the Hadoop road
The road has been a bit bumpy, though. Williams said Hadoop is still an early-stage technology that needs additional features for many enterprise application uses. For example, TrueCar offers "white label" versions of its car-selling platform to banks and other business partners that, in turn, present it to customers with their own logos. "The data in those products requires special handling of a kind not presently supported in Hadoop," he said. "Today, we don't put any of that data into Hadoop."
For that reason and others, TrueCar's software teams are keeping a close eye on the development of new Hadoop features that will help track metadata and enhance data security. Hadoop data will work for banks and financial institutions only if it meets regulatory compliance requirements and privacy laws.
Williams is less interested in some widely touted Hadoop additions. While acknowledging its importance, he said progress in adding SQL support to Hadoop shouldn't obscure the Hadoop development work that can already be done in the Java and Python languages. He asserted that common SQL efforts, requiring upfront schema development and, often, laborious schema revisions later on, can slow business innovation. Programming languages are "an area where we are in a pivot," he said.
TrueCar's data analysts are employing SQL to dig into data in the Hadoop Distributed File System (HDFS) in order to pursue traditional business intelligence questions, Williams noted. They may use Tableau's BI tools together with Hadoop-related technologies, such as Hive or Stinger, to bring SQL queries to Hadoop data. SQL on Hadoop approaches of that sort can lead to valuable insights, he said -- but it isn't the only way to take advantage of the big data framework
"What we are starting to see as well," Williams said, "is something that is truly groundbreaking: Because [HDFS] is one gigantic file system and all the data is in one place, interesting things become possible." He described "open-ended correlation hunting" as one data analytics stratagem that can be applied against raw Hadoop data using advanced algorithms. By doing so, TrueCar discovered surprising correlations between the size of the Web caches that support user sessions and sales performance.
That's another example of how the nature of the data being analyzed by TrueCar includes material that organizations used to throw away or ignore, also known as "dark data." In many companies, the log files that hold information, such as website and server data, were once largely flotsam. Now they're jewels for people like Williams who are using them and other previously untapped data sources to help shape and support data-driven business operations.
Learn how Ancestry.com is using the Hadoop framework for a DNA app
See how a graph database runs Onefinestay's room accommodation service
Discover a data warehousing architect helping Pfizer make better business decisions
See how one nonprofit is embracing a more data-driven strategy