Can you offer some tips on improving
data warehouse performance? We’d like to speed up our data loading and query processing
performance.
I’m delighted to give you some pointers – but please do bear in mind that any consultant worth
his salt would look at the specific issues in a given company before suggesting changes to its data
warehouse systems, and I know nothing about your company. Still, here are my top six tips for
improving the performance of data warehouses:
- Before even thinking about performance optimization, make sure you’ve identified the existing
bottlenecks. If your querying performance is CPU-bound, buying faster disks is a complete waste of
money.
- Know your database
engine. Performance is often compromised when uses don’t know the ins and outs of their
particular database. For example, I’ve seen people using a SQL INSERT statement when the bulk load
utility would have been more effective.
I’ve also seen people use DELETE * to empty a table. That can be painfully slow compared with DROP
and CREATE, which in turn is much slower than TRUNCATE. But of course, your database software might
not support TRUNCATE – which is why you need to know how it works. If you don’t have a good
database administrator, it might be worth hiring one from the standpoint of performance
optimization alone.
- In terms of querying, think about structuring the data
as a MOLAP cube (i.e., a multidimensional online analytical processing one) and cranking up the
aggregation until your query performance flies. That may burn up disk space and data processing
time, but it can make a huge difference.
- Think about using solid
state drives (SSDs). They can be unbelievably cost-effective for disk-bound speed issues.
Recently, I was working with a client that had a 35GB OLAP cube that was performing slowly – both
in terms of aggregation time and query response. I recommended that they try it on an SSD. As it
happened, there was a brand-new PC with a 70GB SSD on the test bench. (It was being built up for a
salesperson who was complaining about slow boot times.)
The machine was “appropriated,” the RDBMS was
installed on the hard drive and a copy of the OLAP cube was created on the SSD. The cube
aggregated much, much faster, but it was in querying that the most dramatic improvements were seen.
Some of the queries ran 20 times faster with the same level of aggregation. The cost of the SSD was
completely trivial (about $200) when compared with the improvement.
People think about SSDs in terms of laptops. (I do myself: I have a 250GB one in the laptop upon
which I am typing these words.) But they also are incredibly applicable to disk-intensive
applications.
- If possible, perform extract, transform and load (ETL) processing in memory. On a long ETL job,
there may be virtue in caching to disk (in case the process fails), but try to keep the data in RAM
during the transformations. And cache to an SSD, not a hard drive.
- Index your analytical structures for analysis, not for transactions. I know that sounds
obvious, but I’ve seen countless relational OLAP star schemas where the indexing strategy clearly
was based on long experience with transactional systems and little familiarity with analytical
ones.
For example, by default many RDBMS engines will index
the primary key of a table. That makes lots of sense in a transactional structure, where many of
the queries will use those indices – but very little sense in a star schema, where it is quite
common for no, none, zero, nada queries to use the primary key indices.
On the other hand, all of the analytical columns in dimension tables are highly likely to be
searched and yet are often entirely bereft of indices.