Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
Big data was already familiar to Ancestry.com Chief Technology Officer Scott Sorensen long before he and his colleagues at the Provo, Utah, company began working with Apache Hadoop and other open-source data analytics tools. But applying the Hadoop framework to DNA data processing posed a new challenge, one that required a special approach to team building on Sorensen's part.
With a 10-petabyte-plus trove of hereditary data, Sorensen and Ancestry.com -- which enables people to explore their family trees online -- had been doing information retrieval on massive data sets for a long time.
Sorensen and his colleagues had built their own search engine with carefully tuned algorithms and record-linking software that could traverse site data both structured and unstructured. With Ancestry.com, which holds birth, death, census and other (sometimes historical) records, unstructured is almost an understatement.
The company looked to improve its information retrieval algorithms as its network of users and collection of family histories grew. For Sorensen, it was one of many curious challenges in a 12-year-plus Ancestry.com career. He and his team felt algorithms could be better honed using site visitors' navigation activity, he said.
"We hired some data scientists. We thought they'd come in and use our proprietary technology and they'd do machine learning to improve the algorithms," Sorensen told an audience at the recent Strata East/Hadoop World conference in New York. The newly hired data scientists had something else in mind, but Sorensen was not surprised at that.
They were interested in using the latest "tools of the trade" more than proprietary tools, Sorensen said.
"So, that's how we introduced Hadoop, MapReduce and R into our tool set," Sorensen said, referring respectively to the clustered data framework, programming paradigm and analytical statistical programming language that have quite often come to represent open-source big data technology of late.
Different kind of code
Ancestry.com's teams used the Hadoop framework to tune searches and to model customer churn rates predictively. In the last year and a half, the company has begun to use Hadoop and its associated HBase NoSQL columnar data store to help scale its AncestryDNA offering. This product uses autosomal DNA testing technology to expand family history research efforts.
Software engineers … could read science papers and they took biology in high school -- so they thought they knew all they needed to know [about genetics].
The goal is to provide customers with information identifying ethnic groups or parts of the world that share DNA code characteristics similar to their own, and even to try to identify distant cousins -- and in turn to expand Ancestry.com's popularity.
A lot of processing is involved. About 700,000 points per DNA sample are compared with the same number of points on samples already in Ancestry.com's database. Here, the efforts of the company's Hadoop-wielding software engineering and data science teams come into play. Sorensen's teams rewrote academic algorithms to run in parallel on Hadoop and Apache HBase in order to effectively speed up this very large data processing job. (For a deep description of the process, read Hadoop framework breathes new life into Ancestry.com legacy tools on sister site SearchCIO.com.)
Scientists meet software engineers
Pulling together a team to apply the Hadoop framework and HBase for DNA data matching took some dexterity. It meant that Sorensen and other managers had to create an environment where people with science skills and people with computing skills could work well together.
The task was complicated, Sorensen said, because the scientists and the software engineers felt somewhat capable of doing each other's jobs. He tells a tale of two talents with some irony: The data science people "were Ph.Ds familiar with bioinformatics, and they thought they could write code. But they'd never written production code -- scalable, conformant, maintainable," he said. "On the other hand we had software engineers. They had an ability to do statistics and could read science papers and they took biology in high school -- so they thought they knew all they needed to know [about genetics]."
The work of this combination was something of a struggle. Data scientists' work might be too academic and hard to use, or the software engineers had not written their systems in such a way that they leveraged the data scientists' work. Then, Sorensen and fellow managers decided to physically sit the two teams next to one another.
"When we had them sit side by side, we have a lot more success. It improved the relay of knowledge across both groups," Sorensen said. "The software engineers would build their systems in such a way that you could plug in the data science."
Up from OS/2
The move to clustered Hadoop software and R for predictive analytics and machine learning is an important leap for Ancestry.com, which had reached a plateau with earlier rules-based systems, Sorensen said. He's seen many such leaps in a career that took him from a job as an OS/2 developer for IBM to engineering manager roles at WordPerfect, Novell and CoreSoft Technologies.
For more on change agents
Look behind the scenes at an IT career
It was after leaving CoreSoft -- a telephony company where he rose to be president -- that Sorensen came to Ancestry.com. It was on the heels of the 2000 technology bubble burst, a time when he was, admittedly, looking for a bit of a breather.
What he found instead was a fast-paced job at a very interesting company that had a series of technical challenges to address. In over 10 years at Ancestry.com, he moved from such positions as vice president of search and vice president of commerce to senior vice president of engineering, before being named CTO last April.
There hasn't been a specific theme to his career, Sorensen says, in response to a reporter's question, just a series of interesting challenges. "I don't get bored. Like other people, I am looking for a fascinating problem to sink my teeth into," he said.
Panera takes on big data challenge with BlueData