For nearly two decades, Manny Puentes has held a ringside seat as new open source software took on established...
data management methods. These days, he's applying the lessons he has learned to the field of online video advertising through a variety of open source big data technologies.
"Open source has really helped push innovation," said Puentes, who is CTO at Denver-based Altitude Digital. There, he leads teams using Hadoop, Spark, Kafka and other big data frameworks to power a software platform for placing video ads on media websites.
The open source tools have proven to be especially useful as big data management building blocks, according to Puentes. "They break the problem down -- divide and conquer," he said. That can also vastly reduce the amount of code developers have to write to "make big data actionable," he added, noting that Altitude's systems need to quickly process and parse large volumes of data about interactions with ads in order to serve up the most relevant ones to individual website users.
From summary info to full big data view
In earlier days, processing limitations and high system costs often meant working with summary data, which could compromise analytics accuracy. Now, according to Puentes, the strategy at Altitude is to work with more complete data sets in an effort "to get very good about making predictions on serving the right ad at the right time to the right person." To help make that happen, the company in April expanded an existing Hadoop cluster from 30 compute nodes to 50 and deployed the Apache Spark processing engine alongside it to add data streaming capabilities and accelerate analytical queries.
Puentes came to Altitude last year from LinkSmart Inc., a maker of a real-time ad bidding engine acquired by another startup in November 2014. Prior to that, while at Federated Media Publishing LLC and Lijit Networks Inc. -- the latter Federated acquired and later separated from -- Puentes led an engineering team that built and ran another system infrastructure supporting real-time ad bidding and delivery.
At each stop along the way, Puentes has made increasing use of Hadoop and related big data software technologies.
"I started as a software engineer, writing the code myself. In those days it was a combination of Java and SQL. Then I started working with Hive, MapReduce and Cascading," he said, referring respectively to open-source data warehouse software -- the programming and batch processing framework tied to the initial versions of Hadoop -- and an application development platform and software abstraction layer for Hadoop.
Working with open source software tools
Open source big data software continues to evolve rapidly, with new functionality being added to fill holes and emerging technologies such as Spark vying for user mindshare as more established ones like Hadoop continue to mature. Over time, Puentes has developed some methods to cope with the fast pace of development and accompanying technical gaps. His first step is to set up sandbox environments where failures can be tolerated without affecting production systems and business operations.
"From an engineering perspective, it's true that there are a lot of cool tools," he said. "But we try to put those things through an incubation period. I'm always trying different technologies, if they make sense. But first you have to play with them."
Another set of requirements comes into play later on, when a stricter operational view is needed. At that point, Puentes said, IT leaders have to look at tools in a different light, evaluating their potential role in what he calls a working "data pipeline." That means organizations deploying big data platforms should consider things like redundancy, added Puentes, who said it's particularly important to ask how fast systems can recover if there's a problem with the software.
"When you work with big data, you sometimes have processes that can take hours, if not days, to run," he said. If those processes fail, IT teams should know whether they have to go back to square one on jobs, and how such outages affect other processes.
As a result, relatively stable big data tools providing some level of fault-tolerance find favor with Puentes, who has used the Hadoop distribution and related software from MapR Technologies Inc. at Altitude and in his previous jobs. "It's nice to be able to get data out of a store quickly, but it's a whole other level of mindfulness by vendors to make sure the products are operationally efficient," he said.
Put big data applications in context
Puentes thinks business context is also important for technology implementations in big data environments. To help provide it, he makes a conscious effort to ensure that development team members understand Altitude's overarching business objectives. You might liken it to making sure that everyone is on the bus -- or, in this case, the Winnebago -- about how the specific analytics services they're working on fit into the company's wider strategy.
"It's an analogy," Puentes explained. "I say, 'It's time for the Winnebago meeting,' and we make a virtual road trip. We go through every product and explain why it's important to the company, to the market." From that, he continued, developers can get a perspective on where they stand within a bigger picture.
"I'm big into nurturing this, because I find everyone around the table ends up covering everyone else's shortcomings," Puentes said. That creates a better project, he added, and also promotes the notion of big data as being core to the business.
Jump into an active Hadoop data lake
Dig deep in Hadoop cluster implementations
Find out how Hadoop can help stem a data flood