Buyer’s Guide: Selecting a data warehouse platform, software or tool

With companies using more data every day, selecting the right data warehouse software platform or data warehouse tool is vital. This guide will help readers through the evaluation and selection process.

Data warehouse software buyer's guide Although data warehouses have become the norm for most companies, preparing yourself to evaluate and purchase a data warehouse platform or appliance is no less important now than it was before. This guide provides an overview of data warehouse concepts and how to evaluate data warehouse software, plus an explanation of why companies should build data warehouses and tips on keeping your data warehouse platform running smoothly after an implementation.

In this section of the Data Warehouse Buyer's Guide, get expert advice on how to select the right data warehouse platform to fit your business needs, including data warehouse software criteria and architecture considerations.

Data Warehouse Buyer's Guide Table of Contents:
Data warehouse guide Selecting the right data warehouse platform for your organization
Data warehouse guide How to evaluate data warehouse software in five steps
Data warehouse guide Commodity hardware aiding data warehouse appliance performance, costs
Data warehouse guide Why you should build a data warehouse
Data warehouse guide Workload management tools key to running busy data warehouses

Selecting the right data warehouse platform for your organization
By William McKnight, Contributor

As companies consider their ability to manage the asset that is information, the data warehouse platform (and its database management system, or DBMS) is the most important decision point. The platform is the foundational component of the tool selections, the consultancy hires, the architecture, etc. In short, it defines your information culture.

However, in selecting the platform to support the data warehouse, organizations are faced with an exponentially higher number of variations and distinct departures from the traditional online transactional processing (OLTP) database management systems than ever before.

Over time, data warehouse data volumes will continue to soar as organizational history accumulates, syndicated data is collected and new sources with more detailed data are added. Furthermore, the community consuming the data continues to grow, expanding well beyond company boundaries to customers, supply chain partners and even users on the Internet. Companies need to make sure they choose a proven platform not just for the initial known requirements but also one with the ability to scale to future, to-be-determined requirements.

Data warehouses, to be successful, need to provide:

  • Quality data that is available to users when they need it.
  • An architecture that enables low long-term total cost of ownership (TCO).
  • Good query performance that results in increased interactive usage.
  • The ability to integrate real-time data feeds.
  • A platform to support mixed and unpredictable workloads.
  • A scalable path forward as data needs grow.

Criteria for data warehouse platform selection
The decision process for selecting a data warehouse platform should go well beyond the usual consideration of the operational DBMS vendor. Today, that decision can be nuanced along several potential requirements, including:

  • Active loading of data and immediate access to loaded data.
  • A mixed processing workload against the data.
  • Cross-functional complexity.
  • The desired level of query concurrency.
  • The organization's platform scalability needs.
  • The required DBMS functionality.

Given the state of the marketplace, the technical architecture for a data warehouse platform should be:

Scalable. Your data warehouse platform should be able to scale in both performance capacity and incremental data volume growth. Make sure the proposed platform scales in a near-linear fashion and behaves consistently with growth in database size as well as the number of concurrent users and the complexity of queries. Understand the additional hardware and software required for each of the incremental uses.

Powerful. The platform should be designed for complex decision-support activity in a multiuser, mixed-workload environment. Check on the maturity of the query optimizer for supporting all types of queries with good performance and to determine the best execution plan based on changing data demographics. Check on conditional parallelism and the causes of any variations in the parallelism deployed. Check on dynamic and controllable prioritization of resources for queries.

Manageable. The technology you choose should need only minimal support tasks requiring database administrator (DBA) or systems administrator intervention. It should provide a single point of control to simplify system administration, and you should be able to create and implement new tables and indexes at will. In addition, it should include a robust set of mnagement features and functions, including DBA productivity tools, monitoring utilities, locking schemes and other security mechanisms, remote maintenance capabilities and user chargeback functionality.

Extensible. The data warehouse platform should provide a flexible database design and system architecture that can keep pace with evolving business requirements and that leverages existing investments in hardware and applications. Ask such questions as: What is required to add and delete columns? What is the impact of repartitioning tables?

Available. The platform should support mission-critical business applications with minimal downtime. Check on “hot pluggable” components, and understand your system downtime requirements and any technical issues that might deny or degrade service to end users. That can include batch load times, software/ hardware upgrades, severe system performance issues and system maintenance outages.

Interoperable. You should choose a data warehouse platform that allows for integrated access to data on the Web, internal networks and corporate mainframes.

Affordable. The proposed technology – hardware and software, plus services and required customer support fees – should provide a low TCO over a multiyear period.

Flexible. The platform should be able to provide optimal performance across the full range of normalized, star and hybrid data schemas with large numbers of tables. Look for a proven ability to support multiple applications from different business units, leveraging data that is integrated across business functions and subject areas.

Referenceable. There may not be one single customer reference that matches your environment exactly, but you should be able to see a consistent trend across a wide range of references that reinforces what you're looking for in a data warehouse platform.

Organizations should also consider vendor viability, especially in these days of marketplace consolidation. The financial stability of vendors, the importance of data warehousing to their overall business strategies and their continued spending on research and development in these areas toward a well-developed and relevant technology vision are all key components of this critical decision.

Data warehouse architecture considerations
To enable the above criteria, companies have some new options in the types of data warehouses that are available, many of which have been created in recent years owing to the high TCO and potential limitations of traditional data warehousing approaches.

Typically, data warehouse environments have several restrictions that organizations have "learned to live with” in seeking acceptable analytical performance. Indexes, summary tables, cubes, utility executions (i.e., reorganizations) and various de-normalizations are often introduced into the environment to help boost performance. There's a point at which overcoming these challenges ceases to be "easy" and affordable and therefore ceases to get done. But perhaps more important than these very real restrictions are the restrictions that they can lead to in "possibilities thinking" and information exploitation. The higher the untuned performance of a data warehouse environment is, the better.

The following technologies may help organizations avoid some of the performance restrictions:

Columnar databases. A row-oriented database is an implementation of relational theory. Data is stored as bytes, with all of the columns in a row stored in order. These bytes are grouped by the several thousand – from 4,000 to 64,000 – into data blocks. This is the unit of input/output, an exception being the “pre-fetch” capability of a row-oriented DBMS to sense a pattern in the data reads. A columnar DBMS is also an implementation of relational theory, but with a twist: The data storage layer contains a grouping of columns. For example, all of the Column 1 values are physically stored together, followed by all of the Column 2 values, etc. The data is stored in record order, so the 100th entry for Column 1 and the 100th entry for Column 2 belong to the same input record.

In a columnar database, all of the same data – your data – is there. It’s just organized differently (automatically, by the DBMS). The main benefits include the ability to highly compress the data. A byproduct of the increased compression is the ability to perform columnar operations – like MIN, MAX, SUM, COUNT and AVG – very rapidly.

Data warehouse appliances. Data warehouse appliances have emerged as viable short-list solutions for new deployments or refurbished data warehousing implementations. Appliance vendors now dot the landscape (with representation in analyst "quadrants" and "spectrums" plus hundreds of millions of dollars in venture backing), and the top data warehousing vendors have jumped into the appliance business through internal development or acquisition. The devices are worthy of any data warehouse professional’s understanding, attention and consideration.

The data warehouse appliance is a combination of hardware, software, operating system, DBMS and storage pre-configured for data management requirements and uses. Many utilize commodity components, and some include open source DBMS software. These open source DBMSs provide a starting point for basic database functionality, and appliance vendors focus on necessary functionality enhancements. Query performance, especially against large volumes of data, is distinctively impressive thanks to the automatic parallelism that many, though not all, appliances provide. Low TCO for a mixed-workload data warehouse environment is also possible, and consequential, with appliances.

Parallel architecture. Parallelism is as important as columnar orientation to query performance success, and massively parallel processing (MPP) systems provide the most parallelism possible. Sharing disk and/or memory across nodes creates overhead. Sharing nothing minimizes disk access bottlenecks and is preferred in MPP, which also features an exponential uptake from symmetric multiprocessing systems and clusters in the number of processors supported and the sophistication of the interconnect. MPP is the generally acknowledged ideal parallel architecture for analytic querying and other data warehousing and business intelligence (BI) applications.

Packaged analytics. Finally, most large enterprise software vendors now offer some type of pre-built reports or dashboards centered on typical business functions such as finance, sales and marketing, and supply chain management. In addition, a number of smaller companies have developed packaged analytics built on licensed platforms and targeted to specific industries, applications or specialized functions, or to supplement other third-party services.

While a true “data warehouse in a box” has remained somewhat elusive, analytic packages bridge the gap between off-the-shelf canned reports and heavily customized BI software. Such packages simplify the development process and deliver functionality with more cost certainty. Organizations must complete a careful evaluation process to determine whether a package is right for them, and they must understand the various costs and benefits of different options. With all of this information at hand, decision makers will be better positioned to make the best choice for their organizations.

Data warehouse selection process
Once the possibilities have been narrowed down according to the evaluation of the available technologies and options, a proof of concept (POC) project should be done with your data. Data warehouse vendors all have the ability to take in your data for an off-site POC. These off-site data centers are well worth using. You can “visit your data” in person during the process, if you wish.

You need to locate and then gather representative data as well as determine the data access paths that you wish to explore in a POC. The proof of concept should comprise these phases:

  1. Prepare and gather data.
  2. Set up environment(s) at off-site location(s).
  3. Validate data-access tool connectivity, performance and load capabilities, with an emphasis on concurrency.
  4. Compile qualitative information on each offering being considered, using published information, consulting help and customer references.
  5. Weigh quantitative and qualitative recommendations on each vendor.
  6. Make sure each important criterion is covered in the POC.

These days, the best answer to your data warehouse platform selection needs may not be as evident as it was in the past. Given the high relevance of data and information, investment and innovation continue in this area and there are a surprisingly large number of options. Many platforms may work but at high short- and long-term prices. Making a selection consistent with your goals as an organization and with a broad and detailed view of the landscape is paramount to success.

William McKnightAbout the author: William McKnight is the president of McKnight Consulting Group. He works as a strategist, lead enterprise information architect and program manager for complex, high-volume and full-lifecycle implementations worldwide, within the disciplines of data warehousing, master data management (MDM), business intelligence (BI), data quality and operational BI. Implementations by his teams, managed from both IT and consulting positions, have won IT best-practices awards. McKnight is a Southwest Entrepreneur of the Year award finalist and a frequent best-practices judge; in addition, he has authored more than 150 articles and white papers and given over 150 keynotes and public seminars. He holds an MBA and is a former VP of IT at a Fortune 50 company and a former DB2 engineer at IBM. McKnight is also the author of the book 90 Days to Success in Consulting. He can be reached at [email protected].

Data Warehouse Buyer's Guide Table of Contents:
Data warehouse guide Selecting the right data warehouse platform for your organization
Data warehouse guide How to evaluate data warehouse software in five steps
Data warehouse guide Commodity hardware aiding data warehouse appliance performance, costs
Data warehouse guide Why you should build a data warehouse
Data warehouse guide Workload management tools key to running busy data warehouses

Dig Deeper on Data warehouse software