Surging volumes of both structured and unstructured data -- what we've come to know as big data -- are putting IT and data management teams under the gun. Information of all types is engulfing computer systems in many organizations, complicating efforts to pull valuable business insights out of it through big data analytics initiatives. At the same time, a cavalcade of new technologies has arrived to help companies cope with the data influx -- but sorting through those technologies is often an intimidating task in itself.
In addition, IT managers must assess whether Hadoop clusters, NoSQL databases and other big data tools can fit comfortably into existing systems architectures or if architectural modifications are needed to accommodate them. The answer varies based on factors such as planned uses, organizational structures and IT maturity.
And the burgeoning business-side interest in extracting business value and deriving competitive advantages from vaults of big data means that there isn't a lot of time to make those assessments and choose between the available technology options. In more and more companies, big data is viewed as a precious resource that business leaders and data scientists want to sift through like prospectors looking for precious metals. This "big data gold rush" puts added pressure on IT and data management strategists to quickly deliver systems that can handle the growing amounts, and increasing variety, of incoming data.
One of the biggest issues in planning a big data strategy is where to put all the data for processing and analysis. It wasn't long ago that transactional data was the primary concern and that the options for managing it tended to boil down to a handful of relational databases. Multidimensional databases, columnar software and other specialized analytical engines added some choices for warehousing data from transaction systems for analysis. Even so, in many companies the big decision was: enterprise data warehouse (EDW) or collection of independent data marts?
Big data technology menu has lots to choose from
But things have changed -- considerably. Collecting and analyzing data from social media sites, sensors, system logs and other non-transactional sources has become a priority for many organizations. And big data technologies that can support those initiatives have proliferated to such an extent that the number of different, and disparate, options is dizzying.
More on choosing, deploying and managing big data tools
Read about the common ground being found by data warehouses and big data systems in many companies
Get advice on overcoming the challenges of implementing Hadoop clusters
Check out our Hadoop project management guide
Matthew Aslett, an enterprise software analyst at research and advisory company The 451 Group, has depicted the plethora of data storage and management choices now available in the form of a London Underground subway map, arraying the available technologies as stations along color-coded lines representing different product categories. In addition to conventional databases, a sampling of those categories includes Hadoop file system implementations as well as schema-less NoSQL databases and "NewSQL" hybrids that use SQL-based relational data models but aim to provide NoSQL-like levels of data scalability. Heightening the potential for buyer bewilderment even more, some categories house technologies of widely varying stripes. In particular, NoSQL is an umbrella term that encompasses a diverse mix of graph databases; document, column and key-value stores; and other types of repositories.
Initially, many big data applications were "greenfield" projects that didn't face some of the issues of typical application development initiatives, such as the need to integrate with legacy systems or structured data sources. Often, technology-savvy data analysts and other business users took a first hack at doing something with unstructured or semi-structured data under the radar of IT and business intelligence managers, taking advantage of the open source nature of Hadoop and many NoSQL tools.
But big data is definitely on the corporate radar now, and the drive to incorporate non-transactional forms of data into mainstream analytics processes is making effective deployment and management of big data systems by IT teams a necessity.
Saddle up the right big data tools
The key is to pick the right technology for the job at hand, in the same way bettors at a race track try to choose "the horse for the course," a phrase that refers to the ability of some thoroughbreds to run better on dirt or grass, or on a dry or muddy track. But multiple database horses might be required for different courses within a big data environment.
ThoughtWorks Inc., a Chicago-based software development services company that also sells application lifecycle management tools, has created a hypothetical online retail application framework to illustrate the concept of polyglot persistence, or using a variety of database technologies to handle different types of data based on which technology is the best fit in each individual case. For example, a key-value NoSQL data store might be best for managing website user-session data as part of the retail framework, according to the ThoughtWorks model. But it envisions the use of four other flavors of NoSQL databases for tasks such as processing online shopping-cart data, powering the site's recommendation engine and storing user activity logs.
And SQL-based relational databases still have their place in this new polyglot world. In the online retail framework, relational technology is depicted as a good fit for financial data that requires transactional updates and is best served by a table-based structure. Reporting also could be the province of a relational database with SQL interfaces at the ready for exchanging data with enterprise reporting tools.
Relational databases are efficient at processing transactions, and through their support for characteristics such as transactional atomicity and consistency, they offer reliability and data recovery capabilities that NoSQL technologies typically can't match. But relational software often isn't suited to text and other unstructured forms of big data. And it requires "a lot of maintenance on the back end," including the need to carefully construct data schemas and modify them when business requirements change, said Pramod Sadalage, a principal consultant at ThoughtWorks. Those issues are minimized with NoSQL and Hadoop offerings.
"What we're saying is, 'Give the things that belong to a certain task to a certain database,' " Sadalage said. "If you have, for example, a [product] catalog, put it in a database that is well suited for that -- then searches go faster."