juanjo tugores - Fotolia

Distributed data management architecture trumps EDW on agility

As data sources continue to increase, organizations are reconsidering enterprise data warehouses for a more flexible approach to data management.

A well-functioning enterprise data warehouse combines information from different subject areas in a central repository, providing senior executives, business managers and operational workers with easy access to clean and consistent data to support the decision-making process. The EDW traditionally has been the only way to provide that kind of data access. But now, the need for increased agility and flexibility is leading many organizations to rethink their strategies and move toward a distributed data management architecture for storing, integrating and managing business intelligence and analytics data.

Without doubt, we're seeing a significant change in the data landscape. It's said that 90% of the world's total data was generated over the past three years. This unprecedented, large-scale surge has made it extremely difficult and expensive for organizations to integrate and maintain data from disparate sources in a central data warehouse. The challenges are further heightened by the increasing focus on unstructured and semi-structured data types, and the exponential growth of new data sources -- for example, syndicated data services, mobile applications and social networks.

In addition, many end users now expect real-time or on-demand access to data. To top it off, data replication and consolidation processes are becoming more complicated as sources multiply, which is adding to maintenance overhead and creating more data quality concerns.

EDW law of diminishing returns

Integrating data from across an organization in an EDW has its advantages -- doing so gives users a comprehensive view of all aspects of the business. But the reality is that most of the time, data from only a subset of subject areas is analyzed together. As a result, the ROI of centrally integrating everything in an EDW starts to diminish as more and more data sources are added.

For example, in a manufacturing company, business executives frequently need to access and analyze a combination of sales, inventory and forecast data. Similarly, having access to a mix of data on raw materials, packaging and procurement contracts can provide a lot of insights to manufacturing managers. But sales and procurement are two different points in a supply chain, and integrating and maintaining data about them in a central repository might not provide enough of a financial return to justify the cost.

That raises a pair of questions: What really needs to be stored in the persistent layer of a physical data warehouse? And how can IT teams best offer access to integrated views of data? At a growing number of organizations, answering those questions is pointing the way to distributed architectures designed to simplify data management and better serve end users.

Distributed data management puzzle

A distributed architecture lets IT manage data in separate systems and create a logical data model that can be used to integrate information for analysis without moving it to a single location. But you need more than relational databases and extract, transform and load tools to make the distributed data management approach work. Such architectures must include the following components as well:

Data virtualization tools. These technologies enable the development of "virtual data warehouses" that provide access to data without first having to extract it from source systems and load it into an EDW. Data virtualization abstracts data from multiple sources to create a unified view for information delivery, analysis and reporting. By avoiding the need to physically move data to a central repository, virtualization makes it easier to add new data sources as information needs evolve; it also provides better access to real-time data and reduces the need to maintain data at multiple layers of an architecture.

The ROI of centrally integrating everything in an EDW starts to diminish as more and more data sources are added.

Centralized master data management processes. MDM is crucial to a distributed data warehouse architecture to help ensure that the logical data model functions effectively. Since data is being integrated on the fly from various source systems or siloed data marts that hold subsets of information, it's imperative that the underlying master data for all the different sources conforms to common specifications and formats. Otherwise, data inconsistencies could hamper BI efforts.

Metadata management. Metadata is data about data. In the data warehouse context, there are three main categories: technical metadata that defines tables, fields, partitions and other data structures; business metadata that defines business rules and calculation logic; and process metadata that catalogs what data is available, where it comes from and how different data sets are related to one another. Metadata must be properly maintained in a distributed architecture to help IT teams identify the lineage of information, avoid the introduction of redundant data and optimize the flow and use of data.

Hadoop systems and NoSQL databases. The advent of new data sources in unstructured and semi-structured formats requires organizations to look beyond relational databases geared to structured transaction data. The need for alternatives can be effectively catered to by Hadoop clusters and NoSQL database systems that address data variety and can scale to handle large volumes.

Properly set up, a distributed data management architecture will be able to house relational and NoSQL databases, Hadoop systems, and other types of technologies under the same virtual roof. Data stored in any of them can be accessed as required, with no constraints from the end-user perspective. The distributed approach has the potential to save significant amounts of money on data integration, replication, storage and management processes. It also provides more agile integration capabilities, enabling organizations to respond faster to ever-changing analytics requirements -- and gain more insight to help drive better business results.

About the author:
Saurabh Jain is senior director of Mindtree Ltd.'s Data and Analytics Solutions consulting services practice. Based in Wayne, N.J., Jain has 15 years of industry experience and has worked in a variety of roles on a wide range of business intelligence, analytics and data warehouse initiatives. Contact him at
[email protected].

Email us at [email protected] and follow us on Twitter: @sDataManagement.

Next Steps

Wayne Eckerson offers a defense of data warehouses

Explore how data warehouse systems can stay relevant

Why you should expand your data warehouse architecture

Dig Deeper on Enterprise data architecture best practices