The 5 V's of big data (velocity, volume, value, variety and veracity) are the five main and innate characteristics of big data. Knowing the 5 V's allows data scientists to derive more value from their data while also allowing the scientists' organization to become more customer-centric.
In the early part of this century, big data was only talked about in terms of the three V's -- volume, velocity and variety. Over time, two more V's (value and veracity) have been added to help data scientists be more effective in articulating and communicating the important characteristics of big data. The number five mirrors the five basic questions every news article should answer. However, it is not specifically required that organizations follow data guideline over the other.
What is big data?
Big data is a combination of unstructured, semi-structured or structured data collected by organizations. This data can be mined to gain insights and used in machine learning projects, predictive modeling and other advanced analytics applications.
Big data can be used to improve operations, provide better customer service and create personalized marketing campaigns -- all of which increase value. As an example, big data can provide companies with valuable insights into their customers that can then be used to refine marketing techniques to increase customer engagement and conversion rates.
Big data can be used by organizations such as in medical or energy fields, for example. Medical fields may use big data to identify disease risk factors, or it can be used by doctors to help diagnose illnesses in patients. Energy industries might use big data to track electrical grids, enact risk management or for real-time market data analysis.
Organizations that use big data have a potential competitive advantage over those that don't since they can make faster and more informed business decisions -- as provided by the data.
Volume, the first of the 5 V's of big data, refers to the amount of data that exists. Volume is like the base of big data, as it is the initial size and amount of data that is collected. If the volume of data is large enough, it can be considered big data. What is considered to be big data is relative, though, and will change depending on the available computing power that's on the market.
The next of the 5 V's of big data is velocity. It refers to how quickly data is generated and how quickly that data moves. This is an important aspect for companies need that need their data to flow quickly, so it's available at the right times to make the best business decisions possible.
An organization that uses big data will have a large and continuous flow of data that is being created and sent to its end destination. Data could flow from sources such as machines, networks, smartphones or social media. This data needs to be digested and analyzed quickly, and sometimes in near real time.
As an example, in healthcare, there are many medical devices made today to monitor patients and collect data. From in-hospital medical equipment to wearable devices, collected data needs to be sent to its destination and analyzed quickly.
In some cases, however, it may be better to have a limited set of collected data than to collect more data than an organization can handle -- since this can lead to slower data velocities.
The next V in the five 5 V's of big data is variety. Variety refers to the diversity of data types. An organization might obtain data from a number of different data sources, which may vary in value. Data can come from sources in and outside an enterprise as well. The challenge in variety concerns the standardization and distribution of all data being collected.
Collected data can be unstructured, semi-structured or structured in nature. Unstructured data is data that is unorganized and comes in different files or formats. Typically, unstructured data is not a good fit for a mainstream relational database because it doesn't fit into conventional data models. Semi-structured data is data that has not been organized into a specialized repository but has associated information, such as metadata. This makes it easier to process than unstructured data. Structured data, meanwhile, is data that has been organized into a formatted repository. This means the data is made more addressable for effective data processing and analysis.
Veracity is the fourth V in the 5 V's of big data. It refers to the quality and accuracy of data. Gathered data could have missing pieces, may be inaccurate or may not be able to provide real, valuable insight. Veracity, overall, refers to the level of trust there is in the collected data.
Data can sometimes become messy and difficult to use. A large amount of data can cause more confusion than insights if it's incomplete. For example, concerning the medical field, if data about what drugs a patient is taking is incomplete, then the patient's life may be endangered.
Both value and veracity help define the quality and insights gathered from data.
The last V in the 5 V's of big data is value. This refers to the value that big data can provide, and it relates directly to what organizations can do with that collected data. Being able to pull value from big data is a requirement, as the value of big data increases significantly depending on the insights that can be gained from them.
Organizations can use the same big data tools to gather and analyze the data, but how they derive value from that data should be unique to them.