This article originally appeared on the BeyeNETWORK.
The most common challenges that constantly hit the user and management teams are as follows:
- "My query that runs for 10 minutes in 100,000 records runs for 10 hours in 1,000,000 records."
- "My ETL processes are slow or have failed due to large data volumes to process into the data warehouse."
- “My CFO wants to know why we need another $500,000 for infrastructure when we invested a similar amount just 6 months ago.”
- “My database administration team is exhausted trying to keep the tuning process online for the databases to achieve speed.”
The key factors that drive effective data warehouse performance and challenge IT departments are data loading, availability, volume, storage and operational issues.
Loading data into a data warehouse is one of the longest processes in terms of time. The process of extracting various data feeds, processing them through data quality, and then data profiling and loading them with or without transformations to a final destination is time-consuming. This can be especially challenging when input volumes are low with smaller bursts of data and speed is impacted by the volume of data in the data warehouse.
Data availability service level agreements (SLAs) have a profound impact on the need to have a high-performance environment. End user requirements must be clearly documented for data to be pristine, integrated and seamlessly available for downstream applications like reporting and analytics. Additionally, organizations often fall short on data growth projections, data demand projections, data retention cycles and associated SLAs that have not been documented.
Data volumes in the average data warehouse have been exploding by gigabytes every day. Growth rates for capturing and retaining granular details have been increasing over the past 3 years. A few reasons for this data volume explosion could include:
- Compliance requirements
- Legal mandates
- Business mergers and acquisitions
Disk and storage systems have consistently improved over the years, both in terms of speed and performance, while costs have been relatively stable and, in some cases, less expensive. In a traditional architecture, storage is shared across all areas of a data warehouse, making it a highly constrained area in terms of availability and performance. ETL and business intelligence (BI) queries produce large amounts of traffic and consuming a lot of space to compute the complex result sets. Shared disk architecture has never been an answer for the data warehouse.
The cost of maintaining a data warehouse has become monumental in many organizations. With the need for granularity of the data growing and history retention growing, a two-way explosion has resulted in an unmanageable amount of information being processed by the data warehouse. In addition to this data volume, multiple kinds of related activities such as data mining, predictive analysis and other historical trending queries have left IT feeling numb and cold rather than energized and happy with the task of managing this infrastructure. Another indirect effect is heavy demand on resources, both in hardware and IT administration (DBA, system administrator, network administrator roles), which often leaves other tasks incomplete from a holistic perspective.
We have so far outlined the various pain points, the issues and challenges that are faced in managing and maintaining a “high performance” data warehouse. In my next article, we will start looking at why the data warehouse appliance can alleviate a lot of this pain.