A buyer's guide to selecting the best data warehouse product
A collection of articles that takes you from defining technology needs to purchasing options
Numerous options exist for buying a data warehouse platform. Although evaluating them need not be a complicated process, taking the appropriate steps will help to ensure that you invest in the best technology possible for your organization's particular business needs.
As we've addressed in this series, there are multiple types of data warehouse platforms, as well as different deployment options and various use cases for data warehousing. Once you've decided to invest in a data warehouse platform, the next step is to create a process for evaluating the available products and then find the one that best fits your requirements. To do this, let's first identify important platform features that enable effective data warehouse development. Then we can pinpoint the deployment option that best matches your requirements.
Features for effective data warehouse development
As we examine the following data warehouse platform features, keep in mind that your implementation and usage requirements will dictate which are most important to you. Not every data warehouse project requires all of the features discussed. As you evaluate specific products (which we'll cover in a separate article), you can use the following to help you dig deeper into the specific features supported by each vendor. With that caveat in mind, a data warehouse provides the following key capabilities:
It offers a consistent view of data. To effectively support business intelligence (BI) applications for analyzing and reporting on historical business activities, a data warehouse platform must be capable of pulling data from multiple source systems and making it look like a single pool of information. Data needed for BI uses is extracted from operational systems and typically transformed to make it consistent, and then loaded into the data warehouse for analysis.
It enables your organization to model and create database designs for data warehousing. A common requirement for a data warehouse is for the database schema to be partially denormalized in order to optimize query and analytical performance. By contrast, online transaction processing systems typically rely on fully normalized schemas to guarantee data consistency and integrity.
In practice, that means the data warehouse is usually designed around a dimensional model, with a central fact of interest and multiple dimensions across which that fact is analyzed. For example, let's say we're interested in analyzing company sales -- there are many dimensions that can shape how this information is analyzed, such as product, territory, store and time. In a simple case, we may ask for total sales (fact) by product (dimension) for January 2016 (dimension) in the Southeast territory (dimension).
To enable such analysis, data warehouses use dimensional models known as star schema and snowflake schema. For a star schema design, multiple dimension tables are associated with a single fact table using a one-to-many relationship. The snowflake schema is similar to the star schema, but dimensions can be stored in multiple, normalized tables instead of a single dimensional table. When reviewing data warehouse platforms, be sure to factor in their built-in support for star schema and snowflake schema database design and query optimization.
It supports OLAP functions to enable your data warehouse to process BI queries. Examples of online analytical processing functions include the ability to drill down, roll up, pivot and rank data. The benefit of OLAP functions is that they enable developers and end users to code less complex queries. Additionally, OLAP functions typically outperform more complex queries conducting the same tasks. If the query and SQL capabilities of the data warehouse platform don't support built-in OLAP functionality, you will likely need to purchase add-on query tools that offer such features.
It provides key performance and query optimization. Being an analytical platform, the data warehouse has different requirements for optimizing queries from an operational or transactional database management system (DBMS). Features useful for maximizing data warehousing performance include support for star join optimization, bitmap indexes and zone maps.
The ability to optimize a star query, in which a fact table is joined to a number of different dimension tables, is an important function of a data warehouse platform. But each platform implements star joins differently. For example, while bitmap indexes are useful for optimizing joins, support for them varies from product to product. Some platforms enable explicit creation of a bitmap index, whereas others generate a bitmap as part of the star join optimization process.
Another performance-related feature to investigate is support for zone maps. A zone is a set of contiguous data blocks or pages on disk. A zone map is a database structure that stores information about the data stored in table zones. Using a zone map, queries can be optimized by pruning data blocks that can't help answer a query, so they aren't accessed.
In-memory functionality. By using memory instead of disk for data storage and processing, performance can be improved. Options include using an in-memory DBMS or adopting a data warehouse platform that delivers in-memory features.
Data movement capabilities. The data warehouse is separate from the operational database systems that run daily business transactions. As such, data needs to be regularly moved from one environment to the other. There are several methods and technologies for moving data, including:
- Simple load and unload utilities.
- ETL functionality for extracting, transforming and loading data.
- Replication technology that captures changed data from source databases and sends only the changes to the target data warehouse.
All of these data movement technologies can be acquired separately from a data warehouse platform. Indeed, if you have in-depth requirements for complex transformations or high-speed replication, an add-on tool is probably your best option because it will usually provide more features and higher functionality. Of course, many data warehouse platforms have built-in data movement capabilities that can fulfill data warehouse transfer needs.
Which type of data warehouse platform is right for you?
Now let's look at the type of data warehouse platform that makes the most sense given your specific project requirements.
Any of the data warehouse platforms -- relational database management system (RDBMS), analytical DBMS, data warehouse appliance or data warehouse as a service (DWaaS) -- can work, regardless of whether you're deploying an enterprise data warehouse (EDW) or a data mart.
For data marts that are being developed and maintained at the department level, it's wise to investigate options that minimize administration and management. The plug-and-play, simple implementation of an appliance lends itself to departmental computing more than the assemble-your-own data warehousing environment required when you implement an analytical or relational database.
DWaaS can be another viable option for a departmental data mart. If much of the data that will go into the data mart is generated in the cloud, DWaaS makes even more sense. On the other hand, if the data mart will be built using large amounts of data that's generated and stored on premises, moving the data to a DWaaS platform can become problematic.
For organizations implementing multiple data marts, management and administration is often centralized, and many advantages of the appliance and DWaaS approaches are mitigated, making the relational and analytical DBMS options more attractive.
When implementing a traditional EDW, a data warehouse appliance is a good option because it delivers a prepackaged combination of hardware and software ready to support data warehousing. For a more extensive data warehouse architecture that includes an EDW, data marts and an operational data store (ODS), a combination of data warehouse platforms can be a compelling option. For example, you may choose to implement the ODS using an RDBMS, while using data warehouse appliances and DWaaS for data marts and an analytical DBMS for the EDW. Such a decision is usually warranted for only the largest organizations with sufficient staffing and expertise in database administration and systems management. Cost is also a factor, as purchasing and supporting multiple data warehouse platforms can be quite expensive.
When extending or expanding your existing data warehouse, the best approach generally is to use your current platform rather than to complicate things by converting to something else. Of course, your team will have added new features to your data warehouse platform since its initial implementation. Be sure to educate yourself on all of these new features before considering a different platform to ensure that you're using it to the fullest extent possible. Embracing a new feature or two, like improved optimization or new OLAP functions, can help you to avoid prematurely jettisoning your existing platform for a shiny new data warehouse appliance or cloud service. However, a conversion to a new data warehouse platform can make sense if your organization's business needs and BI requirements have changed significantly since the deployment of your data warehouse.
For organizations that want to offload support as well as deployment, DWaaS is a better option because the data warehouse architecture is supported by the vendor in the cloud. Also, if your organization does a lot of transactional data processing in the cloud, then DWaaS may be your best option. Keeping data that's generated and stored in the cloud there for data warehousing is a logical approach.
The hybrid transaction/analytical processing (HTAP) approach is becoming more popular because a single platform can be used for multiple purposes. HTAP can minimize the learning curve by reducing the number of different technologies that need to be mastered. Of course, some amount of training still is needed, as this approach typically requires additional new technology and options. Larger and midsize organizations looking to reduce the number of supported technologies and increase flexibility should consider the HTAP approach.
Today's largest data warehouses contain upwards of 10 petabytes of raw data. Supporting the very largest data warehouses demands a flexible approach with customized hardware and software. This usually means a combination of RDBMS and analytical DBMS software running on the fastest servers, storage and networking devices available.
If you need to integrate big data analytics with BI requirements in your data warehouse, you should look at approaches that deliver polyglot data warehousing. The term polyglot is borrowed from the NoSQL database movement, which espouses polyglot persistence -- meaning that data is stored on the most suitable type of DBMS for its intended use.
In a data warehousing environment, which enables traditional BI data to be managed and accessed along with newer types of big data, the polyglot approach involves multiple data platform types. These range from relational and analytical databases to NoSQL DBMSs and new platforms such as Spark and Hadoop. While this adds complexity, it also provides data warehouse users with the ability to couple historical BI with more forward-looking predictive analytics and data mining.
Now that we understand the important features to look for in a data warehouse platform and which support your deployment requirements, the next step is to examine products from the leading vendors. This will be covered in a separate article.
Augmenting traditional EDW with big data technology
Where's the logic in logical data warehouses?
Big data breathes new life into data warehouse software