BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
As it tries to extend its cloud computing lead, Amazon Web Services continues to fill out its data infrastructure...
-- this time with a new query service.
Athena, a data engine that performs SQL queries on data inside the Amazon Simple Storage Service, or S3, is the latest addition to an ever-growing cloud data lineup. Along with some competitive packages, the software heralds more interactive querying of data on the cloud.
The pricing for Athena is simple: $5 per terabyte of data scanned in a query. Such pay-by-query pricing may become the norm on the cloud. Google, for example, has stated pricing of $5 per terabyte for its BigQuery analytics data warehouse service, an increasingly popular offering that provides similar capabilities as Athena on the Google Cloud Platform.
Athena works on data in place in S3, including CSV, JSON ORC and Parquet formats. Like BigQuery, it is "serverless," existing only at runtime as a service and not requiring long-running infrastructure or ongoing management, according to Amazon Web Services (AWS).
Because it has little footprint, Athena's work could be described as spin-ups -- or transient jobs. It's there only when needed.
"With Athena, users will pay only for the queries, rather than for the underlying infrastructure or data integration services," said Matt Aslett, a research director at 451 Research.
Aslett cited potential advantages to querying data in cloud storage, in comparison with querying data stored in Hadoop cloud services or Amazon Elastic Compute Cloud. These advantages include a lower cost for storing data in S3 and ease of scaling.
Fit for fast ad hoc analysis
Amazon leaders described Athena as a complement to Redshift, the company's data warehouse in the cloud, and Elastic MapReduce (EMR), which is its clustered service for Hadoop- and Spark-style data processing. Both are intended to handle large analytics workloads, according to AWS CEO Andy Jassy.
Jassy spoke at this week's AWS re:Invent 2016 conference in Las Vegas. Two years ago at re:Invent, Amazon added the MySQL-compatible Aurora relational database to its product mix, aiming squarely at IBM's DB2, Microsoft's SQL Server and Oracle's 12c database. In addition to Athena, the company this week released a preview version of Aurora that's compatible with PostgreSQL, another open source database.
But the data management and analytics spotlight was primarily on Athena. "Redshift and EMR have made petabyte-scale analytics available to companies big and small. But some customers have to do ad hoc analytics jobs -- smaller jobs -- for data they want to query quickly," Jassy said. Thus, tactical or data discovery work may be Athena's best target.
Athena is generally available in the company's East and West regions, although the rollout will be staged. It will become available in other regions in the coming months, according to the company.
Underlying Athena is Presto, an open source distributed SQL query engine that originally rose out of Facebook's engineering operations. The software has also seen light at Netflix, Airbnb -- which took some part in helping Amazon forge Athena -- and other organizations. Presto is covered by an Apache Software Foundation license.
Athena and Presto could fall under the general heading of SQL-on-Hadoop tools, although such tools have come to support in-memory queries on data that may have never entered the realm of Hadoop at all.
Presto got an enterprise software steward last year, when data warehouse mainstay Teradata pledged Presto support, and it has a growing list of adherents. Aslett pointed to Qubole and Treasure Data as other supporters of the Presto approach.
Amazon's choice of Presto as the basis for the Athena data engine is "a significant endorsement of its suitability for standard SQL-based analysis of multiple data sources," Aslett said. Such tools are important, he added, because SQL skills are widely available within enterprises.
Also handling SQL-on-Hadoop queries is Drill from MapR Technologies. A new version was released this week, with improvements to address interactive query latency. Drill is an open source version of Dremel, a query technology that Google created and outlined in a 2010 research paper and that also underlies the BigQuery cloud service offered by Google itself. In the Microsoft Azure camp, there are distributed SQL query abilities the company gained last year when it purchased startup Metanautix, which offered software similarly inspired by Dremel.
Support for SQL on S3 has been on the rise. Presto has often been linked competitively with Impala, another open source query technology that was created by Hadoop vendor Cloudera, which recently released distributed Impala software that can run directly against data in Amazon S3.
Another Hadoop player, Hortonworks, which has emphasized improvements to Hive for faster SQL queries, this month released Hortonworks Data Cloud for AWS. It has improved integration with Amazon S3 and better support for what Hortonworks calls "ephemeral workloads" -- one-off jobs in which some data does not need to persist.
Just say 'no ETL'
Behind a general industry move to highly distributed SQL tools like Presto is a drive toward analyzing data where it resides, without first having to extract and load it into a database or data warehouse, according to Aslett and others.
What Amazon is offering with Athena is "no ETL," said Jake Stein, CEO and co-founder of Philadelphia-based Stitch Inc., an extract, transform and load (ETL) services provider formed earlier this year as a spin-off of RJMetrics when that firm was acquired by e-commerce provider Magento.
Matt Aslettanalyst, 451 Research
"No ETL," according to Stein, means the ETL process is supplanted by extract, load and transform (ELT), where data transformation happens in SQL as needed for downstream use instead of upfront during the loading stage. He admitted it might be surprising for an ETL firm like his to promote the notion of ELT, but explained the latter's benefits.
"With Athena, you extract the data from the sources, and then load it with no or minimal preprocessing. This style of ELT is a superior model for most use cases, because it results in a simpler architecture and gives analysts more visibility into how the raw data becomes transformed," said Stein, who spoke from re:Invent.
As another sign that Amazon is intent to fill any gaps in its data-related offerings, the company also introduced managed ETL services at re:Invent. Known as AWS Glue, the service will crawl users' data sources, create a catalog and handle data transformation and scheduling. At present, interested users can request to take part in a controlled beta.
Beyond ETL, there is more in the way of disruption that products like the Athena data engine could bring to the status quo.
Its ability to use cloud storage, rather than Hadoop data stores, may lead some to see Athena as a threat to Hadoop in the cloud -- a move that has recently gained attention as Hadoop software vendors, with roots in on-premises computing, have moved to support S3, transient workloads and pay-as-you-go pricing.
But Aslett disagreed, saying both Hadoop and relational data warehouses can still offer throughput and latency advantages over approaches that analyze data in cloud storage.
"The launch of Athena doesn't mean the end of Hadoop on the AWS cloud," he said. "For longer-term and larger projects with complex query requirements, Redshift or EMR may be the logical choice."
Learn more about data engines, SQL on Hadoop and the rise of DevOps
Find out how to evaluate SQL tools for big data jobs
Plot a course to SQL on Hadoop