Sergey Nivens - Fotolia
Vendors of database, data warehouse and big data platforms are rushing to bulk up their cloud services offerings -- the same goes for developers of business intelligence and analytics tools. And a growing number of user organizations are signing on to store, manage and analyze data in cloud-based systems. But there's mixed reaction when it comes to actually doing big data analytics and BI in the cloud. Some businesses are jumping in totally; many others are keeping one or both feet on the ground.
First Tech Federal Credit Union is gravitating toward the full cloud camp to help drive an analytics program it has been building up over the past two years. First Tech, the eighth-largest credit union in the U.S. with about 400,000 members and $8 billion in managed assets, planned in February to begin production use of a Hadoop cluster running in the Microsoft Azure cloud. The cluster, built around the Hadoop distribution from Hortonworks, will function as a data lake for collecting raw information from various internal and external sources and as a data warehouse for feeding refined data sets to departmental data marts, said Naveen Jain, director of digital analytics at the credit union.
The data marts are also in the cloud, running on a configuration of Microsoft's SQL Server database in Azure virtual machines -- a virtualization technology designed to give corporate IT teams increased control of their Azure environments. In the second half of this year, Jain expects to move a report server now located on premises into the cloud to make it easier for First Tech's branch managers to access analytical reports generated with Tableau's BI and data visualization tools. The cloud-based Tableau server also will be used to give customers online access to annual reports about their accounts, he explained after speaking at the 2016 TDWI Executive Summit in Las Vegas. In addition, the credit union is looking to tap Microsoft's Azure Machine Learning cloud analytics application to do predictive modeling aimed at helping it sell additional financial products to customers.
Data analysts in First Tech's business operations will continue to use Tableau's desktop BI software to analyze information in the data marts and build reports. But Jain said the credit union, which is based in Mountain View, Calif., and primarily serves employees of IT companies, hopes that doing analytics and BI in the cloud with the Hadoop platform and related technologies will lower its IT costs while providing increased scalability and flexibility. The cloud "improves your ability to react to business needs more quickly," he said, adding that he's confident Microsoft has built solid data security protections into Azure.
Electronic payments processor PayPal Holdings Inc. is also taking a cloud-heavy approach to data analytics, although it opted not to go the public cloud route. The San Jose, Calif., company runs the majority of its 1,500 front-end applications, including analytics and reporting programs, in a homegrown private cloud built around OpenStack, an open source cloud computing platform.
PayPal set up the private cloud five years ago. Jigar Desai, vice president of cloud and platforms at the company, said the OpenStack architecture has greatly improved IT efficiency, particularly when it comes to provisioning infrastructure resources for new applications. Previously, the process could take six to eight weeks, Desai said. He added that now, even as the cloud infrastructure is strained by the exploding demands of big data, his team typically can implement required computing resources in a single day.
But not all of PayPal's analytics applications are deployed in the cloud. Some of the company's most intensive data science work -- predictive modeling designed to identify potentially fraudulent payment transactions -- is done outside of the OpenStack setup to meet a business need for analytics speed.
Hui Wang, PayPal's senior director of global risk sciences, said cloud-based applications couldn't deliver the extremely fast performance her team requires. The risk sciences unit is primarily responsible for developing machine learning algorithms that analyze large volumes of historical payment data to build a picture of likely fraud; the fraud models are then applied to transactions in real time to try to stop scammers before payments go through. PayPal risks losing customers if transactions take too long to complete, so Wang said the entire analytics process has to wrap up in less than a second.
"We need very short turnaround times," she said, noting that her team analyzes "hundreds of thousands of data attributes" in each transaction and runs the results against a risk score generated by the machine learning algorithms. To meet the performance requirements, Wang and her colleagues use big data platforms such as Hadoop and the Spark processing engine, plus a mix of data management technologies and analytics tools from Teradata, Oracle and SAS Institute.
Cloud analytics hangs low
The overall cloud services market has become a rapidly growing big business. In January, consulting and market research firm Gartner predicted that worldwide public-cloud revenue in the infrastructure-as-a-service segment will increase 38% this year to $22.4 billion. Sales of software-as-a-service (SaaS) applications will rise 20% to $37.7 billion, it forecast.
But data analytics and BI in the cloud still isn't mainstream, according to a survey of IT and business professionals conducted by TDWI in May 2015. Only 35% of 309 respondents said their organizations were using the cloud for data management or analytics purposes, and another 35% said they were considering it, according to a report on the survey that was released last October (see "Slowly Gathering in the Cloud").
Meanwhile, SaaS tools and business analytics/BI in the cloud capabilities ranked next-to-last and last, respectively, in importance to self-service BI initiatives among a group of 12 technologies that survey respondents were asked about. Fifty percent and 46%, respectively, cited those two items as being either very or somewhat important to their organizations; by comparison, the top three choices -- self-service data discovery, data visualization and self-service dashboard authoring -- all topped the 75% mark. TDWI analysts wrote in the report that the results reflect the immaturity of cloud analytics technologies and "continuing concern about security and governance of sensitive data and analytics [applications]" in the cloud.
Data security, privacy and governance issues have been some of the biggest roadblocks to public cloud adoption, particularly when customer records and other analytics data crucial to business success are involved. Other hurdles include the complexity and cost of moving data from on-premises systems to the cloud and the availability of excess processing capacity in corporate data centers.
Benefits outweigh risks
Goutham Belliappa, big data, integration and reporting practice leader at consultancy Capgemini, said that security shouldn't be as big of a concern for prospective users as it was in the past, thanks to improved data protections added by cloud platform vendors over the years. He pointed out that the Central Intelligence Agency and the National Security Agency are big users of the Amazon Web Services (AWS) cloud, albeit in a private cloud region set up by the CIA for use by the federal intelligence community.
Belliappa said data governance in the cloud can be more of a challenge, especially for companies that do business in Europe, where data privacy laws are particularly stringent. On the whole, though, he thinks the challenges facing users are manageable: "While there are considerations for going to the cloud that need to be kept in mind, I think there are no reasons left for organizations not to go to the cloud."
And for companies that already do much or all of their data processing in the cloud, analyzing data there is a natural fit -- especially if they haven't first invested in on-premises data management and analytics systems.
"I can't think of a reason for me to go on-prem," said Gal Barnea, CTO at Eyeview Inc., a video advertising startup in New York. Eyeview, which serves up personalized video ads to consumers via the Web and mobile devices for corporate clients, implemented a cloud-based Spark platform from Databricks late last year to power machine learning applications and other types of analytics.
The ability to quickly scale up or down available system resources was the primary benefit that attracted Barnea to the cloud. That's important to him because the processing capacity his team needs can change dramatically from month to month and even throughout the course of a week.
On average, Eyeview crunches about 1.5 terabytes of website data, retail-store purchase records and other info each day to evaluate 15 to 20 billion ad opportunities for its clients. But given the seasonal nature of the company's business, certain times of the year are much busier than others. For example, processing volumes in the lead-up to Black Friday and the traditional holiday sales season are huge; they tend to be much lighter in the summer months. Storing and analyzing the data in the AWS cloud, which is where the Spark platform runs, lets Barnea add processing resources when things pick up and shut them down to save money when they're no longer needed.
"The cloud allows my costs to be very much tied to the actual business," he said. "If Q4 is large for the company, my engineering costs are also large. But in Q2, which is not strong for advertising, my costs are also small."
View from the ground
By comparison, analytics processes are much more grounded in microprocessor giant Intel Corp.'s sales and marketing operations. They use an on-premises Hadoop cluster, based on Cloudera's distribution of the technology, to pull together a variety of internal and external data for analysis by business analysts and other end users via self-service BI tools. David Schaefer, chief BI architect in the sales and marketing IT group at Intel, said a couple of the business units his group serves are just starting to embrace cloud BI software.
Using the cloud could give Intel's sales teams an easier way to access BI data when they're on the road meeting with customers, said Schaefer, who also spoke at the TDWI conference in Las Vegas. He added that some BI and analytics vendors are enhancing their SaaS tools more quickly than their on-premises software, and he cited the cloud's same potential for reducing resource requirements and simplifying technology upgrades as other IT managers did.
But the Intel IT group needs to be "very rigorous" about vetting the security protections in cloud tools, Schaefer said. He also noted that there could be costs associated with moving data to the cloud for analysis if it becomes necessary to do so. For now, at least, the sales and marketing organization is only pursuing some "small, focused efforts," he said. "It's not like we're shifting everything to the cloud."
Executive editor Craig Stedman also contributed to this story.
More advanced data visualization tools
Managing big data platforms in the cloud
Expel the myths surrounding private cloud