A buyer's guide to selecting the best data warehouse product
A collection of articles that takes you from defining technology needs to purchasing options
Data warehouses offer a window into an organization's historical performance and ongoing operations, providing data analysts and business users with information on things such as customer behavior, business trends, and quarterly and annual sales. Despite the emergence of Hadoop and other big data technologies, the growing need for companies to capture and analyze data from various sources is keeping the data warehouse as relevant as ever, if not more so. But before investing in a data warehouse platform, the first step is to examine whether your organization really needs one and what business benefits it can receive.
To accomplish this, you must consider the two data warehouse deployment options -- enterprise-wide or departmental. You also need to determine if unstructured big data will be a component of the data warehouse environment and decide whether to integrate traditional data warehousing for online analytical processing (OLAP) uses with data processing and management for big data analytics. Finally, you must be able to match the various use cases for data warehousing to the most appropriate data warehouse platform types.
Why does your organization need a data warehouse?
The general concept of data warehousing is quite simple: Data is regularly extracted from the operational systems that support the business and copied to a specialized system -- a data warehouse -- for analysis and reporting via dashboards, portals and business intelligence, reporting, and analytics tools. The following conditions may indicate that your organization could benefit from a data warehouse:
- You're struggling to report effectively on business activities within your company because required data isn't readily available.
- Data is being copied separately by different departments and groups for analysis in spreadsheets that aren't necessarily consistent with one another.
- Uncertainties about the accuracy of data are causing corporate executives and business managers to question the veracity of reports.
- BI reporting against production databases causes nightly or monthly processing of transaction data to be extended.
- Running ad hoc queries against production databases slows down operational systems.
With a properly implemented data warehouse, you can help your organization accurately answer questions about your business, such as what happened and why. Data warehousing improves data availability because it collects data from disparate locations and sources into a central repository. Once the data is in the warehouse instead of in production databases, operational workflows become more efficient because analytical activity has been moved to a separate system. As it moves through to the data warehouse, the data is evaluated, cleansed and transformed; this means that the quality of the information in reports generated from the data warehouse should be improved.
Data warehouse deployment options
The data warehouse environment can differ greatly across organizations, however. Deployments can follow one of two paths -- an enterprise data warehouse (EDW) or a data mart or a combination of these.
An EDW is architected to contain all of the pertinent data from an enterprise's operational systems -- and perhaps external data sources -- and is used across all departments. The data is manipulated for query purposes, transforming and aggregating it for BI gathering (see Figure 1).
Some organizations have implemented an operational data store (ODS) as an interim step between the operational systems and the data warehouse. Operational data is copied to the ODS and then extracted for use in the data warehouse. The ODS offers a single source for pretransformed data and is available should analytical queries against the data warehouse require more details than are available in the data warehouse.
Another popular approach is to create data marts, which are smaller-in-scope data warehouses that focus on individual business and subject areas. Organizations often opt to build data marts when a specific departmental need for reporting and BI can deliver a quick ROI. Instead of undertaking an expansive project that encompasses the entire enterprise, a data mart can be more focused and provide business benefits more quickly.
One or more data marts may be created, depending on the size and focus of your organization. The data marts can be integrated with other data marts to form a virtual EDW or used to physically populate an EDW in organizations that combine the two approaches (see Figure 2). Adopting the data mart approach enables an organization to take an iterative approach, tackling individual subject areas one at a time instead of the monolithic approach of an EDW.
Both of these deployment options are possible with each of the types of data warehouse platforms available: traditional database management systems, usually relational; specialized analytical DBMSes; data warehouse appliances; and cloud-based data warehouse as a service (DWaaS) technologies.
Big data versus data warehouse
The marketplace is awash in confusion regarding big data technologies and whether they're on the verge of replacing data warehousing. But in most organizations, that isn't the case. Instead, big data systems usually will augment the data warehouse.
A data warehouse platform is typically based on a relational DBMS, and the data in it is structured and generally originates from an organization's operational and transactional systems. Data warehouses are accessed by business executives and analysts using BI dashboards, OLAP and reporting tools, and ad hoc SQL queries.
Big data analytics, on the other hand, is typically supported by nonrelational technologies such as Hadoop, Spark and NoSQL DBMSes. The data can be both structured and unstructured, and can originate from every type of internal system plus external data sources, such as social media. Analytics are performed on big data for discovery and insight -- for example, to predict future patterns and behavior based on past activities. These tasks are accomplished using in-depth analytical models developed by data scientists and statisticians in languages such as Python, R and Scala -- perhaps with some rudimentary SQL as well.
Big data technologies also enable the concept of the data lake -- a repository of raw data from various sources that can be kept in its native formats and then filtered and prepared for analysis as needed.
In the end, though, big data systems and data lakes aren't direct replacements for data warehouses and data marts. In fact, there's a synergy between them. For example, sets of structured data collected in a data lake can be moved to a data warehouse for conventional analysis. Vendors also have begun to integrate big data access into their data warehouse platforms, enabling data stored in different formats to be accessed from a single place.
Business use cases for data warehouses
Now let's examine the different use cases and scenarios whereby your organization will approach its data warehouse projects.
First time vs. long-time users of data warehousing: Companies with an existing data warehouse implementation that are looking to expand typically would do well to extend the usage of the data warehouse platform that's already in place. For organizations implementing their first data warehouse, a data warehouse appliance can be a compelling option. The chosen vendor will work with you to determine the appropriate size and configuration of the appliance, which is delivered ready-to-go. Of course, all data warehouse platforms are viable options for first-time users, and you should examine the additional use cases in this section to develop your specific requirements.
Company size: Larger companies might choose one of these approaches: relational DBMS, analytical database or hybrid transaction/analytical processing (HTAP). These organizations have the infrastructure and staff to successfully use such technologies. In the HTAP approach, the same DBMS can be used for both operational transactions and analytical modeling and BI. To support HTAP, a relational engine may offer built-in columnar access or an add-on appliance for analytical processing.
On the other hand, smaller organizations should consider a data warehouse appliance or DWaaS setup. Both options minimize administrative needs: The appliance comes preconfigured, and the data warehouse in the cloud provided by DWaaS is set up and managed by the cloud service provider.
Cloud usage: If your organization runs a lot of applications in the cloud, DWaaS may be your best option. Keeping business data that's generated and stored in the cloud there for data warehousing and analysis is a logical approach.
Size of IT department: Organizations with smaller IT departments tend to have fewer administrators who can set up and tune the data warehouse environment. For them, it also makes sense to consider a data warehouse appliance or DWaaS.
Data latency requirements: For near-real-time data warehouses requiring minimal latency, optimized data warehouse appliances likely make the most sense. It isn't impossible to achieve near-real-time latency with other approaches, but you may need to augment your data warehouse environment with additional high-speed hardware and software to meet your latency objectives.
Need to integrate with big data: When integration with big data analytics is important, consider data warehouse appliances and DWaaS offerings. They tend to have the greatest integration with big data analytics capabilities.
Now that you understand the general use cases for data warehouses, the next step is to investigate the available options for buying a data warehouse platform. That will be examined in the next article in this series.
Note to big data vendors: Stop bashing data warehouses
Check out TechTarget's 2015 salary survey for BI and big data professionals
Keeping your head above water with Hadoop data lakes