Big data applications: Real-world strategies for managing big data
A comprehensive collection of articles, videos and more, hand-picked by our editors
Web intelligence provider comScore Inc. has made some big changes to its "big data" analytics operation.
The Reston, Va.-based firm moved its big data environment from Cloudera Inc.'s distribution of Apache Hadoop to a competing version offered by MapR Technologies Inc. Hadoop is an open source framework for quickly processing large data sets across clusters of computers.
ComScore, which boasts over 1,000 in-house servers and continues to use Cloudera for training purposes, said the decision to make the switch was based largely on cost considerations and the fact that MapR offers support for the Network File System (NFS) protocol.
"We could capitalize the purchase [of MapR] with an annual maintenance charge versus a yearly cost per node," said Mike Brown, ComScore's chief technology officer. "NFS allowed our enterprise systems to easily access the data in the cluster."
Growing data stores prompt comScore to take action
ComScore monitors and measures the behavior of online shoppers. The company keeps track of more than 2 million consumers who have given comScore permission to monitor and analyze how they shop and surf online. By analyzing consumer behavior, comScore is able to provide advertisers with valuable intelligence on how to target marketing campaigns and reach their desired demographic.
"The vast majority of Internet advertising is planned, bought and sold using our tools," said Brown. "We suggest the best sites for [advertisers] to use for each campaign."
But the job of keeping track of more than 2 million consumers and gaining insights about their behavior means that comScore must process huge volumes of data on a daily basis. According to Brown, the company currently manages over a petabyte of information and counting.
The unrelenting growth in the amount of data under management prompted the company to begin working with Hadoop back in 2009. ComScore made the switch from Cloudera to MapR last July.
With the global economy limping along, many IT professionals have reported that getting executive buy-in, approval and funding for data management projects is a big challenge. But this wasn’t a problem in the case of comScore’s MapR implementation, according to Will Duckworth, the company’s vice president of core processing.
"I think everybody recognized that we are dealing with huge volumes," Duckworth said. "We were able to [easily] make the case."
ComScore went live with MapR last July after a relatively painless implementation process that took about two days, according to Duckworth. At the time, the company had a Cloudera production cluster up and running and wanted to complete the migration to MapR with little or no downtime. ComScore's IT team accomplished its goal by simply copying and reloading the Cloudera-based data into the new MapR cluster in one fell swoop.
"If we had to do that again at this point, we probably wouldn't reload the data because [the data stores are] much larger now," Duckworth said. "We would probably instead take a rolling approach where we take maybe 25% of the machines out and convert them over to MapR, copy some of the data over, then take another 25% and move them over that way."
Duckworth and Brown particularly like MapR's Direct Access NFS feature, which exposes Hadoop Distributed File System (HDFS) data as NFS files which can then be easily mounted, modified or overwritten.
"HDFS is great internally, but to get data in and out of Hadoop, you have to do some kind of HDFS export," Brown said. "With MapR, you can just mount [HDFS] as NFS and then use native tools whether they’re in Windows, Unix, Linux or whatever."
Sorting software increases the speed of data preparation
The MapR Hadoop distribution has helped comScore significantly speed up big data management operations, but it isn’t the only piece of the company’s high performance computing puzzle.
ComScore also uses Sybase IQ, the high-speed analytic database from business applications giant SAP AG, to power its Customer Knowledge Platform (CKP), a data warehouse that provides users with insights into consumer behavior on the Internet.
According to comScore, the CKP service monitors the activity of more than 1 million consumers and the Sybase IQ-powered data warehouse currently holds more than 40 terabytes of compressed information.
Additionally, comScore is running data integration and sorting software from Woodcliff Lake, N.J.-based Syncsort Inc. to accelerate Hadoop processing. The company went live with Syncsort in 2009 and recently upgraded to Syncsort DMExpress 6.5, the latest version of the software that offers newly added support for Hadoop.
DMExpress helps comScore compress the flow of incoming data by aggregating repetitive strings of information before loading them into MapR for further processing and analysis. According to Brown, comScore has directly embedded Syncsort with 25-30 business applications in an effort to increase the efficiency of the data preparation process.
“We brought Syncsort in to help solve the problem of sorting because our volume of data was rapidly increasing,” Brown said. “[Syncsort’s] compression algorithms look for repeating strings and in sorting the data it pulls those repeating things together and therefore increases your compression ratio.”
Big data analytics best practices revealed
Organizations mulling the possibility of a big data initiative should remember to plan for exponential growth – because the big data explosion shows no signs of slowing down, according to Brown.
Companies interested in data sorting software should look for products that are easily implemented and that fully leverage existing hardware, he added.
“That kind of technology can help accelerate a lot of your [systems],” Brown explained. “But a big thing that people don’t think about is how easily you can plug the software into existing applications.”