Strata + Hadoop World 2016: Hadoop and Spark in spotlight
Reporting and analysis from IT events
Picture two @WalmartLabs engineers. They are on break at a beer garden just across the street from their offices in Mountain View, Calif. They start to kick around an idea to marry some data sets, run some analytics and see if they can gain a "digital lift" -- an improved response rate from customers that visit Walmart stores and connect to the company's websites.
The engineers plan to mix online and offline transaction data with ad-impression data and see what happens. How difficult would such a big data application be to build?
Not that difficult, actually, if you've already made a concerted effort to deal with key issues in data handling in Hadoop. That's according to Jeremy King, CTO and senior vice president of global e-commerce at Walmart.com and a leader at @WalmartLabs, who told attendees at Strata + Hadoop World 2015 in New York about his team members' beer garden brainstorming. Parent company Wal-Mart Stores Inc., based in Bentonville, Ark., operates Walmart locations, Walmart.com and @WalmartLabs.
Because they and others had built a Hadoop-based central repository with firewalls, encryption and data anonymization that, in effect, tokenizes sensitive data -- and because that repository secured data but still allowed access -- the two engineers were able to create a prototype of WMX in a few days. WMX is shorthand for the Walmart Exchange, which culls sales, social media and other data for advertising decisions for the company and its suppliers.
"We built specialized firewall clusters for [Payment Card Industry] data. It's 'anonymized.' Names, phones and email data are segregated. Other details are encrypted," said King.
A vote for big data applications
King and his colleagues had to build a system that satisfied the stringent requirements of a corporate legal team.
Addressing those requirements led them to intelligently manage data at the source -- as it was ingested into Hadoop. The approach King described to Strata + Hadoop World attendees was far removed from the Hadoop data lake as portrayed by vendors. Data is carefully refined before it enters the system. It includes detailed transactional data, which measures in the tens of petabytes, according to King.
"Internal customers need access to data in real time. They need to be able to do experiments, and to do those on entire data sets. If we can clean and tokenize data, we can allow people to access it," he said.
The Walmart team was crystal clear on one thing: Hadoop should not be an end in itself.
"Sometimes people set up Hadoop, but they don't give anybody access to the thing," said King, pointing to one issue among a host cited as reasons behind Hadoop's demonstrably slow uptake in the enterprise. Underlying the approach Walmart took on the Hadoop infrastructure was a notion of what King called ''data democracy'' -- in other words, the effort to open up new data-driven applications to internal users.
King said the Hadoop effort began in earnest in 2011 -- about the time @WalmartLabs experienced something of "a reboot" -- and he admitted that Walmart had its share of missteps in the Hadoop sandbox. At one point, the growth of Hadoop clusters caused the team to stop and take a breath, he said. As a result, King and his crew defined an architecture that made cluster growth manageable.
The resulting data infrastructure has laid the framework for a series of big data applications the company hopes can move forward Walmart's efforts to leverage in-store information with website data.
The first applications to ride the Hadoop data, said King, include:
- eReceipts, which provides customers with electronic copies of receipts;
- Savings Catcher (built atop eReceipts), which alerts customers when a Walmart store or its neighboring competitors have reduced the cost of a purchased item, and automatically sends a gift certificate for the price difference; and
- an application that maintains up-to-date maps of thousands of Walmart's store plans. The maps can tell you where a small box of toothpicks resides in a big store, King kidded.
Is this how Hadoop will happen?
Hadoop is short on one thing that often spells IT success: the killer app. Among historic killer apps that inspired new technologies were the spreadsheet ledgers driving the PC, the MCI Friends & Family app that spurred the automated, in-network price break, and the FedEx Web package tracking that pushed enterprise Web use forward.
Such killer apps represent one or two firms' incredible innovation that then quickly becomes just a cost of doing business for a whole industry. That has yet to become something associated with Hadoop.
It isn't easy to pinpoint any one Walmart killer app. But, over time, the company's deployment of point-of-sale terminal data, bar code readers, supplier network communications, radio frequency identification, data warehouses and logistics have seemed to coincide with phenomenal surges in its revenues. Walmart's handling of each of these new technologies became the blueprint for others looking to grow or just to keep up.
So, it isn't too surprising to hear King say that a steady stream of suppliers and partners ask to get a tour of the @WalmartLabs operations, often to see how the retailer handles Hadoop. When they do ask, he tells them that large-scale Hadoop can be implemented by a team of as few as eight or 10 engineers. But, he also emphasizes to them that the Hadoop deployments should strive for data democracy from the get-go so that the data can be accessed by the wide organization and not just a handful of data cognoscenti.
Walmart faces challenges. It gets harder to grow an operation when its revenues mount to almost $475 billion a year. Some folks would say Amazon.com, with Web services and large-scale cloud clusters, has created an e-commerce killer app aimed straight at the company atop the Fortune 500. It will be interesting to see if Walmart's coupling of Hadoop and data democracy will help it deflect such challenges.
Find out more about big data applications
Read about innovative cognitive techniques now emerging
Look at Hadoop issues, challenges