Can you offer some tips on improving data warehouse performance? We’d like to speed up our data loading and query processing performance.
I’m delighted to give you some pointers – but please do bear in mind that any consultant worth his salt would look at the specific issues in a given company before suggesting changes to its data warehouse systems, and I know nothing about your company. Still, here are my top six tips for improving the performance of data warehouses:
- Before even thinking about performance optimization, make sure you’ve identified the existing
bottlenecks. If your querying performance is CPU-bound, buying faster disks is a complete waste of
- Know your database
engine. Performance is often compromised when uses don’t know the ins and outs of their
particular database. For example, I’ve seen people using a SQL INSERT statement when the bulk load
utility would have been more effective.
I’ve also seen people use DELETE * to empty a table. That can be painfully slow compared with DROP and CREATE, which in turn is much slower than TRUNCATE. But of course, your database software might not support TRUNCATE – which is why you need to know how it works. If you don’t have a good database administrator, it might be worth hiring one from the standpoint of performance optimization alone.
- In terms of querying, think about structuring the data
as a MOLAP cube (i.e., a multidimensional online analytical processing one) and cranking up the
aggregation until your query performance flies. That may burn up disk space and data processing
time, but it can make a huge difference.
- Think about using solid
state drives (SSDs). They can be unbelievably cost-effective for disk-bound speed issues.
Recently, I was working with a client that had a 35GB OLAP cube that was performing slowly – both
in terms of aggregation time and query response. I recommended that they try it on an SSD. As it
happened, there was a brand-new PC with a 70GB SSD on the test bench. (It was being built up for a
salesperson who was complaining about slow boot times.)
The machine was “appropriated,” the RDBMS was installed on the hard drive and a copy of the OLAP cube was created on the SSD. The cube aggregated much, much faster, but it was in querying that the most dramatic improvements were seen. Some of the queries ran 20 times faster with the same level of aggregation. The cost of the SSD was completely trivial (about $200) when compared with the improvement.
People think about SSDs in terms of laptops. (I do myself: I have a 250GB one in the laptop upon which I am typing these words.) But they also are incredibly applicable to disk-intensive applications.
- If possible, perform extract, transform and load (ETL) processing in memory. On a long ETL job,
there may be virtue in caching to disk (in case the process fails), but try to keep the data in RAM
during the transformations. And cache to an SSD, not a hard drive.
- Index your analytical structures for analysis, not for transactions. I know that sounds
obvious, but I’ve seen countless relational OLAP star schemas where the indexing strategy clearly
was based on long experience with transactional systems and little familiarity with analytical
For example, by default many RDBMS engines will index the primary key of a table. That makes lots of sense in a transactional structure, where many of the queries will use those indices – but very little sense in a star schema, where it is quite common for no, none, zero, nada queries to use the primary key indices. On the other hand, all of the analytical columns in dimension tables are highly likely to be searched and yet are often entirely bereft of indices.
This was first published in May 2011