Big data is often characterized by 3Vs: the extreme volume of data, the wide variety of data types and the velocity at which the data must be processed. Although big data doesn't equate to any specific volume of data, the term is often used to describe terabytes, petabytes and even exabytes of data captured over time.
Breaking down the 3Vs of big data
Such voluminous data can come from myriad different sources, such as business sales records, the collected results of scientific experiments or real-time sensors used in the internet of things. Data may be raw or preprocessed using separate software tools before analytics are applied.
Data may also exist in a wide variety of file types, including structured data, such as SQL database stores; unstructured data, such as document files; or streaming data from sensors. Further, big data may involve multiple, simultaneous data sources, which may not otherwise be integrated. For example, a big data analytics project may attempt to gauge a product's success and future sales by correlating past sales data, return data and online buyer review data for that product.
Finally, velocity refers to the speed at which big data must be analyzed. Every big data analytics project will ingest, correlate and analyze the data sources, and then render an answer or result based on an overarching query. This means human analysts must have a detailed understanding of the available data and possess some sense of what answer they're looking for.
Velocity is also meaningful, as big data analysis expands into fields like machine learning and artificial intelligence, where analytical processes mimic perception by finding and using patterns in the collected data.
Big data infrastructure demands
The need for big data velocity imposes unique demands on the underlying compute infrastructure. The computing power required to quickly process huge volumes and varieties of data can overwhelm a single server or server cluster. Organizations must apply adequate compute power to big data tasks to achieve the desired velocity. This can potentially demand hundreds or thousands of servers that can distribute the work and operate collaboratively.
Achieving such velocity in a cost-effective manner is also a headache. Many enterprise leaders are reticent to invest in an extensive server and storage infrastructure that might only be used occasionally to complete big data tasks. As a result, public cloud computing has emerged as a primary vehicle for hosting big data analytics projects. A public cloud provider can store petabytes of data and scale up thousands of servers just long enough to accomplish the big data project. The business only pays for the storage and compute time actually used, and the cloud instances can be turned off until they're needed again.
To improve service levels even further, some public cloud providers offer big data capabilities, such as highly distributed Hadoop compute instances, data warehouses, databases and other related cloud services. Amazon Web Services Elastic MapReduce is one example of big data services in a public cloud.
The human side of big data analytics
Ultimately, the value and effectiveness of big data depends on the human operators tasked with understanding the data and formulating the proper queries to direct big data projects. Some big data tools meet specialized niches and allow less technical users to make various predictions from everyday business data. Still, other tools are appearing, such as Hadoop appliances, to help businesses implement a suitable compute infrastructure to tackle big data projects, while minimizing the need for hardware and distributed compute software know-how.
But these tools only address limited use cases. Many other big data tasks, such as determining the effectiveness of a new drug, can require substantial scientific and computational expertise from the analytical staff. There is currently a shortage of data scientists and other analysts who have experience working with big data in a distributed, open source environment.
Big data can be contrasted with small data, another evolving term that's often used to describe data whose volume and format can be easily used for self-service analytics. A commonly quoted axiom is that "big data is for machines; small data is for people."