Google's big data infrastructure: Don't try this at home?

Google's Jeromy Carriere spoke about the search engine giant's big data infrastructure at a recent TDWI meeting. Should it influence other efforts?

While Google Inc.'s homebrewed data infrastructure software is not exactly "Hadoop," it influenced the creation of the Hadoop platform. The company has long been known for its army of programmer wizards and its talent for creating distributed programs.

Some of Google's data workings were on display at a recent Boston Chapter meeting of TDWI that took place at Boston Children's Hospital in Boston, Mass. The discussion was led by Jeromy Carriere, who is a "technical lead" at Google.

Carriere started out with a few fun facts. Big data processing at Google is, well, big. For example, the company has reported sorts of over a petabyte of records in just a bit more than six hours. That job ran on 8,000 computers, and a few disk drives were probably killed during the experiment.

Google also had to build infrastructure management tools to support large-scale data pipelines, according to Carriere. Writing MapReduce programs was the easy part. "What is hard is versioning, deployment and configuration," he said. To handle those tasks and many others, the company employs a large cadre of system engineers. If infrastructure is plumbing, the engineers might be called ''plumbers of the future.''

Should we leave the plumbing to the plumbers?

Google's system engineers are typically well versed in system administration issues, according to Carriere. "We don't have a wall," he said, referring to the often cited adage about developers "throwing software over the wall" for sys admins to fix.

Google's relationship to open source software is unique. It's a build-don't-buy house, but it is better known for publishing influential technical papers than it is for formally supporting open source efforts.

Google's technical papers about Google File System (GFS) and Google MapReduce formed the basis for open source Apache Hadoop Distributed File System (HDFS) and MapReduce -- two major building blocks of Hadoop. Additionally, Google's BigTable led the way to Apache Hadoop HBase.

Behind much of Google's work is a drive to improve on existing relational data warehouse approaches and apply those processes to distributed environments. It is a pretty amazing effort.

Working in the data analytics vineyard

Google has been working in the data analytics vineyard for a long time -- and there is a great deal of interest in how the well-known company manages the information it collects. Google has an exceptional -- and well-funded -- data developer culture and its ability to create distribute data architecture outpaces most enterprises.

For more of 'Talking Data'

Catch the buzz when Lucene search meets Hadoop

Find out about the Hadoop ecosystem

Learn about business decisions in the age of big data

"It made a lot of sense for the Googles of the world to invent their own software to handle their volume of unstructured data," said Rick Sherman, founder of consultancy Athena IT Solutions in Maynard, Mass., and a TDWI Boston Chapter leader who also was at the event. But Sherman warned that the skills needed to run a Hadoop-style infrastructure are difficult to find. "If it is difficult for 'the Googles,' how does that bode for the mere mortals of the world?" he asked.

In the future, Sherman said, those skills may be found in the cloud. That will give companies an opportunity to leave the plumbing to the plumbers.

Jack Vaughan is SearchDataManagement's news and site editor. Email him at [email protected], and follow us on Twitter: @sDataManagement.

Dig Deeper on Enterprise data architecture best practices