Can you offer some tips on improving data warehouse performance? We’d like to speed up our data loading and query...
I’m delighted to give you some pointers – but please do bear in mind that any consultant worth his salt would look at the specific issues in a given company before suggesting changes to its data warehouse systems, and I know nothing about your company. Still, here are my top six tips for improving the performance of data warehouses:
- Before even thinking about performance optimization, make sure you’ve identified the existing bottlenecks. If your querying performance is CPU-bound, buying faster disks is a complete waste of money.
- Know your database engine. Performance is often compromised when uses don’t know the ins and outs of their particular database. For example, I’ve seen people using a SQL INSERT statement when the bulk load utility would have been more effective.
I’ve also seen people use DELETE * to empty a table. That can be painfully slow compared with DROP and CREATE, which in turn is much slower than TRUNCATE. But of course, your database software might not support TRUNCATE – which is why you need to know how it works. If you don’t have a good database administrator, it might be worth hiring one from the standpoint of performance optimization alone.
- In terms of querying, think about structuring the data as a MOLAP cube (i.e., a multidimensional online analytical processing one) and cranking up the aggregation until your query performance flies. That may burn up disk space and data processing time, but it can make a huge difference.
- Think about using solid state drives (SSDs). They can be unbelievably cost-effective for disk-bound speed issues. Recently, I was working with a client that had a 35GB OLAP cube that was performing slowly – both in terms of aggregation time and query response. I recommended that they try it on an SSD. As it happened, there was a brand-new PC with a 70GB SSD on the test bench. (It was being built up for a salesperson who was complaining about slow boot times.)
The machine was “appropriated,” the RDBMS was installed on the hard drive and a copy of the OLAP cube was created on the SSD. The cube aggregated much, much faster, but it was in querying that the most dramatic improvements were seen. Some of the queries ran 20 times faster with the same level of aggregation. The cost of the SSD was completely trivial (about $200) when compared with the improvement.
People think about SSDs in terms of laptops. (I do myself: I have a 250GB one in the laptop upon which I am typing these words.) But they also are incredibly applicable to disk-intensive applications.
- If possible, perform extract, transform and load (ETL) processing in memory. On a long ETL job, there may be virtue in caching to disk (in case the process fails), but try to keep the data in RAM during the transformations. And cache to an SSD, not a hard drive.
- Index your analytical structures for analysis, not for transactions. I know that sounds obvious, but I’ve seen countless relational OLAP star schemas where the indexing strategy clearly was based on long experience with transactional systems and little familiarity with analytical ones.
For example, by default many RDBMS engines will index the primary key of a table. That makes lots of sense in a transactional structure, where many of the queries will use those indices – but very little sense in a star schema, where it is quite common for no, none, zero, nada queries to use the primary key indices. On the other hand, all of the analytical columns in dimension tables are highly likely to be searched and yet are often entirely bereft of indices.
Dig Deeper on Data warehouse software
Related Q&A from Mark Whitehorn
Here's a guide to primary, super, foreign and candidate keys, what they're used for in relational database management systems and the differences ... Continue Reading
The unstructured data types common in big data systems are often better managed by a NoSQL database than relational software, Mark Whitehorn says. Continue Reading
Analytics expert Mark Whitehorn explains the strengths of R and how to determine if the open source programming language fits your analytics purposes. Continue Reading