Polyglot persistence must be the ugliest term we have ever come up with in the IT industry. It means that different persistency technologies (read data storage technologies) are used to store data. For example, when an organization has distributed data over Oracle databases, Hadoop, MongoDB, DB2, and Riak, they operate a polyglot persistent environment.
For a long time, the goal of many IT departments was to avoid polyglot persistence and to have one and the same default database server for storing most of the data. This had several practical benefits relating to management costs, license fees, training costs, and so on.
The rise of NoSQL products, such as MongoDB, Apache HBase, Cassandra, has changed our views. These products have become so successful because they can reach transaction levels or offer analytical processing speeds that cannot be matched by most SQL products, and because their costs makes them very attractive.
However most of them are not generic database servers that can be used for any type of application. They are what you could call specialized database servers. For example, MongoDB is very strong in transactions, but not ideal for doing complex analytics, and Hadoop HDFS with MapReduce is great for analytics, but not good for interactive reporting. It’s this specialization that makes them excel in a few application areas.
It’s because of their price/performance ratio that organizations acquire them, but only for a limited number of applications. For example, MongoDB may be used to support a highly interactive web application, Hadoop may be used to offload some of the cold data stored in a SQL-based data warehouse, and MarkLogic is used for storing and analyzing more textual information.
The result of this approach is clear: an organization ends up with a polyglot persistent environment.
So, eventually, polyglot persistence comes with a price. Multiple data storage products means multiple developer skills, multiple DBA skills, and multiple analyst skills. The programmer has to learn new languages and APIs, the DBA has to learn new backup/recovery utilities, new optimization techniques, and the analyst has to study the new database concepts and how to model them best.
But we all know that eventually all that data has to come together, it has to be integrated. Even when the data produced by an application is initially only used by one application, eventually the data is needed by other applications and other users. For example, the data may have to be integrated for analytical purposes, or used in operational applications, or may be needed to create 360 degrees views of customers or products. Databases never end up being islands. Applications may come and go, but data always remains and over time has to support an ever-changing set of applications.
Technologies exist for integrating data from different data storage technologies, such as data virtualization, ETL, and ESB. The integration requirements determine which one is best. It’s important that when organizations intend to invest in a polyglot persistent environment, their study should not restrict itself to the costs of installing and operating that new product, but must include an evaluation of the somewhat hidden future costs involved in integrating data stored in a polyglot persistent environment. In other words, besides looking at the current requirements of one application, also include future integration requirements. What is needed to integrate the data stored in this new data storage technology with the existing environment?