A newly revamped approach to dealing with huge volumes of data is leading one clinical trial support firm to find...
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
previously hidden revenue streams. But the columnar database software and hardware project that got them there didn’t exactly go smoothly, according to one company official.
Provisio Inc. is the creator of iTrials, a data and design service created to help doctors and pharmaceutical firms find the right patients to take part in clinical trials. Not long ago, Provisio had a growing data management problem, according to Sean Harrison, the company’s chief security officer and senior information architect.
“The database was becoming insane,” Harrison said. “If I count all of my data sources across the board, we were approaching all of the health history for almost 70 million Americans. That’s just a whole lot of data.”
At the time, Provisio’s database was distributed across several Microsoft SQL Server clusters. At last count, Harrison recalled, the whole setup consisted of about 230,000 database tables.
“It is the largest, to my knowledge, privately owned collection of healthcare data anywhere,” he said.
Provisio wanted new database software that would help the company do its job faster – and its job is running data-intensive queries for clients. If a pharmaceutical company wants to find out how many people in a specific area have diabetes, the iTrials system can search through its vast supply of anonymous medical data and deliver the information. Those results can then be further refined based on additional cross-referencing and search criteria.
Evaluation leads to columnar database software
Harrison looked at a slew of relational database software and related tools from Oracle, Teradata and ParAccel. He also re-evaluated Microsoft SQL Server offerings.
“Microsoft claimed that their new platform, properly configured, was thousands of times faster than the current implementation that we had,” he said. “But the only guys who would actually run a test query for me were [Teradata] and ParAccel.”
Harrison said Teradata was fast and easy to use, but he thought the company lacked a decent selection of development tools to go along with the product.
“At the same time, ParAccel has a lot of those same limitations. There aren’t a lot of tools yet,” he said. “But ParAccel makes up for that by having a very active support and a very active development community.”
The contractor actually screwed that up, and the cold water feed started pouring in the ceiling and rained down on the server room for a day. That was a little disappointing.
Sean Harrison, chief security officer and senior information architect, Provisio Inc.
Provisio chose the ParAccel Analytic Database and went live just under two months ago with the software running on five distributed nodes. As a result, the company was able to migrate its data from 230,000 tables down to 12, and the speed with which it runs queries has increased significantly.
“ParAccel really seems to excel at very straight-ahead queries,” Harrison explained. “Our queries don’t have a lot of joins and don’t have a lot of fancy table structures and things going on. They’re very straight-ahead queries, but they’re run against very, very big sets of data.”
Provisio has had success with the product so far, but Harrison said he would be happier if ParAccel offered more development tools. The ParAccel Analytic Database runs on Linux, and Harrison said he frequently finds himself seeking management and development tools from various open source software vendors.
“For writing raw SQL, I can either do it from the command line or I can use WinSQL or another open source utility,” he said. “All these tools I would like to see come from ParAccel.”
ParAccel is a columnar database that was designed to return queries much faster than traditional, row-based relational databases, said Bala Narasimhan, senior product manager at ParAccel. He added that ParAccel has many partnerships with both open source and proprietary database software tool vendors.
“We work today with all the existing BI tools and ETL tools,” Narasimhan said. “And in our upcoming release, you will see many more partnerships announced around analytic [software] and visualization vendors.”
Columnar database implementation presented some problems
Harrison said the hardware appliances for the project came preconfigured from HP based on ParAccel’s specifications, and that was one of the only parts of the implementation that went smoothly.
“There are a lot of hoops to jump through to be set up for this thing,” he said. “With only the five machines, we had to ask our electricians to come in and draw up two separate-circuit, single-phase, 208-volt services to run these things.”
Provisio’s IT crew also had to calculate how much heat the new hardware would generate and implement the proper cooling for the server room.
“The contractor actually screwed that up, and the cold water feed started pouring in the ceiling and rained down on the server room for a day,” Harrison said. “That was a little disappointing.”
Increased database speed leads to new revenue streams
The increase in speed provided by ParAccel has allowed Provisio to launch a new product line called iTrials Direct Connect. Harrison said the new product allows Provisio clients to log into the system, specify criteria and run queries themselves – and clients can expect the results of those queries to appear almost immediately.
“This is a product that wasn’t even possible before,” Harrison said. “I mean we never even thought of it.”
Looking ahead, he said Provisio might take advantage of the increased speed by offering more new services, such as support for class-action lawsuits related to medical issues.
The company would also be willing to expand the current distributed database environment if necessary. The initial investment was “expensive,” Harrison explained, but the costs associated with expanding are relatively low.
“If we reach a point where the data becomes slow, we’ll just throw a couple more nodes at it, and it’s going to be between six and 10 grand per node,” he said. “That’s really not very much money in the great scheme of things.”