ra2 studio - Fotolia
IT shops continue to be challenged by the unbridled growth of their organization's data stores. Big data specialists need to capture, analyze and present an ever-increasing amount of data to their end users.
For this reason, many organizations are moving their data management infrastructure to the cloud, which offers the benefits of scalability and managed services. But with so many options from different vendors available, it's important to know what you're getting into before making the database cloud migration leap.
Data volumes that were rare a few years ago are now commonplace. Day-to-day operational systems are storing such large amounts of data that they rival data warehouses in disk storage and processing complexity. Advancements in IoT, machine learning and artificial intelligence are contributing to an astonishing amount of data generation.
An IDC study sponsored by Seagate states that the volume of data we store will grow from 33 zettabytes in 2018 to 175 zettabyes by 2025. The study also shows that almost 30% of the world's data will require real-time processing, and the IDC estimates that the IoT segment alone will create 79.4 zettabytes of new information during this time.
Vendors of all sizes are capitalizing on this rapid growth by offering a wealth of big data products to the IT community. Evaluating the endless array of cloud data pipeline offerings can be challenging to even the savviest big data experts. The IDC is also forecasting that the big data tool market will experience double-digit annual growth, reaching $274 billion by 2022.
Shops evaluating an on-premises to cloud big data pipeline migration need to begin their analysis by answering these fundamental questions about their cloud database architecture:
- Do we migrate our system as it currently exists, or should we rethink the architecture?
- Is our current architecture meeting our needs?
- Is it flexible enough to satisfy our organization's future big data analysis requirements?
- Will our current system allow us to easily leverage new data pipeline products and advancements in big data technologies?
- Do we design, install and administer our own architecture, or do we rent the platform from a cloud provider?
- How will cloud data transfers affect our environment?
- What modifications will we need to make to our current systems and network architecture to enable high-performance data transfers to the cloud? (The analysis should include separate evaluations for batch and streaming data transfers.)
- How will the cloud impact our adherence to governmental, industry-specific or organizational regulatory compliance frameworks like PCI, HIPAA, NIST, NERC and GDPR?
- What additional costs will we incur? (Examples include staff training, organizational changes and infrastructure enhancements.)
Cloud data pipeline platform offerings
While you're considering your database cloud migration options, let's review the top three super-sized cloud providers' big data offerings. This will be a brief introduction to the data pipeline products available.
Microsoft's HDInsight provides a robust suite of Apache products, including Hadoop, Spark, Hive LLAP, Kafka, HBase and Storm. In addition, the HDInsight platform also offers Microsoft's (ML) Server for R-based analytics.
Microsoft offers Event Hubs for real-time streaming data ingestion and Azure Databricks for Apache Spark-based analytics.
AWS offers a wide range of big data storage and data pipeline products. Amazon EMR provides customers with access to Amazon EC2 and S3, the Hadoop Distributed File System (HDFS) and Amazon DynamoDB. Amazon EMR big data products include Apache Spark, Hive, HBase, Flink and Hudi.
Google also offers a robust product suite that covers every aspect of big data analytics. Product offerings include Cloud Dataflow for batch and stream processing, Cloud Pub/Sub for event ingestion and Cloud Dataproc, which is a fully managed Apache Hadoop and Spark service.
In addition to the examples above, a few of the more popular competitors include industry favorite Cloudera as well as Qubole and IBM.
Performing your data pipeline platform evaluation
Chris FootSenior Strategist, RadixBay
Shops evaluating data pipeline products should follow a standardized product evaluation methodology to facilitate the selection process. Evaluation best practices include selecting the appropriate evaluation team, performing a thorough needs analysis and creating a robust set of weighted evaluation metrics.
Enterprises should also educate themselves on available Hadoop distribution vendors. Each offer distinct features and options that provide customizations best suited for different compute environments.
Organizations now have more database cloud migration options available to them than ever before. To correctly design and implement the most appropriate cloud database architecture for their organization, big data specialists must evaluate and compare large data store ecosystems as well as individual products.