Sergey Nivens - Fotolia
Big data systems, for some companies, aren't just platforms for new types of data processing and analytics applications -- they're the driving force behind entirely new business strategies.
That's the case at iPass Inc., which is using a big data environment to fuel a strategic shift from pay-for-use Wi-Fi access to tools for managing and optimizing mobile connectivity for corporate users. Introduced in late 2015, the company's iPass SmartConnect software includes algorithms that identify Wi-Fi access points and rank them on performance so mobile users can connect to the fastest and most reliable hotspots available. That marks a big change from when iPass gave users a static list of hotspots.
And it wouldn't be possible without the underlying data management platform built around the Spark processing engine, said Tomasz Magdanski, director of big data and analytics at iPass. "We do need the big data architecture, 100%," he explained. "There's no way we could crunch all this data in real time and do all the ranking and measurements without it."
The path to deploying the architecture wasn't completely smooth. The Redwood Shores, Calif., company initially used an on-premises Hadoop and Spark cluster, but Magdanski said it ran into scalability and system maintenance problems that held back the production launch of the cloud-based SmartConnect offering. To get things back on track, iPass switched in mid-2016 to a managed set of Spark clusters from Databricks running in the Amazon Web Services (AWS) cloud. That enabled it to start putting the components of SmartConnect into production early this year, according to Magdanski.
Tomasz Magdanskidirector of big data and analytics, iPass
One Spark cluster runs extract, transform and load jobs on data collected from wireless hotspots around the world, currently averaging 25 to 30 million records daily. The various SmartConnect algorithms, including ones that analyze access speeds and quality of service at hotspots, run on their own small clusters -- part of a strategy to separate processing jobs to avoid dependencies or conflicts, Magdanski said. The processed data is stored in the Amazon Simple Storage Service (S3), with an open source Cassandra database at the front end to provide information to network administrators.
SmartConnect hasn't immediately boosted business at iPass. Indeed, it reported an 18% year-to-year revenue decline in the second quarter of 2017, partly due to delays in expected deployments of the new software. But there are deals in the pipeline, Magdanski noted. In addition, his team is developing data products fed by the big data systems for sale to mobile-network operators, ad platform vendors and owners of hotels and other venues for access points as a new revenue stream.
Stepping beyond crawling
RiskIQ Inc. also has tapped a big data architecture to broaden its business horizons. Adam Hunt, RiskIQ's chief data scientist, said that when he joined the San Francisco company in early 2014, it primarily did real-time web crawling to look for websites at imminent risk of being attacked. But within a year, the company built a data warehouse and analytics environment on top of an existing Hadoop cluster, enabling the development of new security products that take advantage of all the data collected during the crawls.
"We're able to leverage our passive data sets to a much greater degree than we could before," Hunt said. "It has really changed everyone's perspective on what we're capable of doing with our data." And that, he noted, "has changed the direction of our business."
Adam Huntchief data scientist, RiskIQ
RiskIQ had deployed the cluster, based on the Hadoop distribution from MapR Technologies, in 2012. But the system initially was used only to store files of raw data that could be delivered to customers looking to analyze it themselves, Hunt said. Now, the Hadoop platform underpins a set of analytics applications designed to help organizations inventory and monitor their websites, mobile apps and other internet-connected assets; identify external security threats; and investigate and respond to attacks.
On a daily basis, RiskIQ crawls as many as 20 million webpages and pulls up to 25 TB of data from them into the cluster, which is housed in a colocation facility. To save on storage space, Hunt said the company converts the raw data into Apache Parquet Files that are one-tenth the size of the originals. Even so, he added, the cluster holds about 500 TB of data in MapR's proprietary file system and the HBase database; the jobs running in it also feed indexes with 3 TB of data in a separate cluster that runs the Solr search server. In addition, 2 petabytes of older data sets are archived in an S3 repository in the AWS cloud.
Instructions not included
Deployments of big data systems often stall due to their complexity, Gartner analyst Merv Adrian said, pointing to 2017 survey results showing that only 17% of Hadoop projects had reached active use. Inquiries about Hadoop and data lakes from corporate IT teams are getting more specific about implementation issues, Adrian added, during a presentation at the 2017 Pacific Northwest BI & Analytics Summit. He likened building a big data architecture to putting together a jigsaw puzzle, "but the problem is you don't have the box to see how it's supposed to look."
TMW Systems Inc. felt some of that pain due to a lack of big data skills when it started deploying a Hadoop environment in 2015. "We had to start from scratch," said Timothy Leonard, who was brought in to lead the project as TMW's executive vice president of operations and technology. "The talent wasn't there when I got here. Early on, I spent most of my time just teaching people about [big data] concepts."
But TMW, a vendor of transportation management software for trucking companies, began using the platform in early 2016 to support a new set of analytics applications that it's offering to customers. Previously, carriers could only access their own data for analysis, Leonard said, adding that now they can see aggregated industry data on pricing, fuel use and other parameters.
Timothy Leonardexecutive vice president of operations and technology, TMW Systems
Operational data from carriers that use TMW's software is pulled out of the Mayfield Heights, Ohio, company's ERP systems into the big data architecture, which is based on the Hortonworks distribution of Hadoop and split between an on-premises installation and Microsoft's Azure cloud. Leonard said the data, currently amounting to several hundred terabytes, is stored in a set of HBase tables with up to 9,000 columns each. The wide tables enable users of the analytics applications to "ask really any question they want to" about the data, he explained. Multiple tables can also be joined together for querying -- for example, to combine data on fuel consumption and road conditions.
A need for processing speed
Network security startup ProtectWise Inc. built its business from the get-go around a group of big data systems that currently collect about 10 billion records of operational data from corporate networks daily. The data is used for both real-time and historical analysis of security threats, and the Denver company commits that it won't miss or lose any records as part of its contracts with customers.
Capturing, processing and analyzing all of the network data as it's generated 24/7 "would be physically impossible without this architecture," ProtectWise co-founder and CTO Gene Stevens said. "We would never have gotten off the ground."
The cloud-based big data architecture is centered on the DataStax Enterprise (DSE) implementation of Cassandra, which powers the real-time analytics routines aimed at detecting attacks in progress against networks. The NoSQL database also stores indexes pointing to petabytes of historical data kept in S3, typically for a year. ProtectWise uses Spark's Structured Streaming module to analyze that data for insight on previous security events and attack indicators, tapping into the DSE indexes to find relevant data sets. Also, Solr is linked to DSE for use by customers running their own analytical queries.
Big data ingestion tough to swallow
The most challenging part of big data applications might not be running analytics algorithms against voluminous data sets. Instead, it could be the more prosaic task of ingesting the data in the first place.
That's the case at TMW Systems Inc. The transportation management software vendor pulls a mix of structured and unstructured data into its Hadoop cluster, including "tons of sensor data," said Timothy Leonard, TMW's executive vice president of operations and technology. Other types of data platforms could perhaps handle analytical processing, he added, "but there was nothing else out there I could find that could do the ingestions."
To reduce the risk of performance problems, big data teams should "think well ahead" on deployments, particularly in the case of extract, transform and load jobs for data ingestion, said Tomasz Magdanski, director of big data and analytics at mobile connectivity company iPass Inc. "Always build for scale," he recommended.
That was seconded by Gene Stevens, co-founder and CTO at ProtectWise Inc. Being able to pull growing amounts of data into the network security vendor's big data architecture without any missteps was a top priority. "We knew we had to get this hardened," Stevens explained. "We couldn't fail to ingest data."
At peak operation levels, the DSE system processes about 6 million transactions per second, Stevens said. ProtectWise also uses a homegrown processing engine written in Scala with the Akka toolkit to feed data requiring high throughput rates into DSE and then pull it back out for analysis; that system processes another 1 billion transactions daily, he added.
Putting the architecture in place took some doing. ProtectWise, which was founded in 2013, began working with the open source version of Cassandra but switched to DSE to speed up the deployment and get built-in Solr integration. Even then, it ran into functionality issues in the database, particularly with the Solr ties, Stevens said. The DSE and S3 setup went live in mid-2014 after the issues were resolved, but Spark couldn't meet the company's processing needs, requiring development of the homegrown engine. Spark was finally added to the mix in early 2016.
Now, data ingestion throughput "is only bounded by network latency," Stevens said. And he thinks the big data systems will have staying power as ProtectWise looks to grow both its business and the volumes of data it processes. "Not that we don't have to watch it or manage it," he acknowledged, "but we're pretty convinced that this technology will continue to scale for us into the future."
More real-world advice on deploying and managing big data platforms
Beachbody uses a cloud data lake to pump up its analytics architecture
Big data systems pose structural issues, just like conventional ones do