The answers you get to the question “What is big data?” will typically depend on the perspective of whomever you’re asking. During the late 1990s and early 2000s, when the term first came to prevalence, a quantitative definition of big data might have described it as any piece or set of information greater than a gigabyte (1 GB) in size. These days, that amount of information could comfortably sit on a memory chip the size of your thumbnail in an age where big data is reckoned in terms of petabytes, exabytes, and zettabytes of information.
A more subjective definition might describe it in terms of the huge volume of information being continuously generated by people, technology, and transactions, the velocity with which it’s appearing (along with the speed with which it needs to be processed and analyzed), and the vast variety of sources that contribute to it.
Looking at big data from a qualitative perspective – and taking into account that structured, unstructured, and semi-structured information sources contribute to the world’s data store – it’s possible to define it as information that’s so extensive, vast, or complex that it’s difficult or impossible to process using traditional methods and technology.
What Is Big Data in Simple Terms?
Big data is huge volumes of information whose random nature only makes it possible to manage and analyze using specialist techniques and technology. Data sets can run to petabytes or more in size.
The core characteristics that you’ll find in most of the literature are the three “Vs”: volume, velocity, and variety.
Volume relates to the sheer size of the data sets involved. With information coming from business transactions, smart devices (IoT or Internet of Things), industrial equipment, videos, social media, and other streams, it’s now commonplace to measure big data in terms of petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of information and even larger denominations, running to billions or even trillions of records.
How Many GB Is Big Data?
“Big data” is a term relative to the available computing and storage power on the market — so in 1999, one gigabyte (1 GB) was considered big data. Today, it may consist of petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of information, including billions or even trillions of records from millions of people.
Velocity refers both to the rate at which new information is being generated and to the speed desired or necessary for it to be processed for timely and relevant insights to become available. With mission-critical data coming from RFID (Radio Frequency Identification) tags, connected sensors, smart meters, and the like, the velocity of processing and analysis of the data often needs to be real-time.
Variety is an indication of the multitude and diversity of information sources that make up big data. This runs the gamut from numerical data in traditional databases, through multimedia streams of audio and video, to financial transactions, text documents, emails, and metadata (information about information).
Structured data takes a standard format capable of representation as entries in a table of columns and rows. This kind of information requires little or no preparation before processing and includes quantitative data like age, contact names, addresses, and debit or credit card numbers.
Unstructured data is more difficult to quantify and generally needs to be translated into some form of structured data for applications to understand and extract meaning from it. This typically involves methods like text parsing, natural language processing, and developing content hierarchies via taxonomy. Audio and video streams are common examples.
Semi-structured data falls somewhere between the two extremes and often consists of unstructured data with metadata attached to it, such as timestamps, location, device IDs, or email addresses.
Besides these characteristics described, data analysts also need to consider the ingestion (taking in), harmonization (cleaning and balancing), analysis, visualization, and democratization (making readily available to all relevant stakeholders) of data sources, and the results of big data analysis.
Why Is Big Data Important?
Information is an integral component of our daily lives. From the answers we get to our queries and searches online, to the databases underlying the operations of the essential services and businesses we deal with, to the algorithms helping to regulate our transport, social media, and the delivery of utilities.
In all cases, data management helps to move big data efficiently to where it’s needed. Data analytics enables scientists to build statistical models and visualizations to help make sense of it, and domain knowledge provides the expertise to interpret the results from data analysis.
The impact of big data on society may be assessed in terms of the numerous areas where it affects communities, economies, and the environment: weather prediction, forecasting natural disasters, urban and community planning, traffic management, logistics, and healthcare are just some of the areas that determine why it is important.
The importance of big data to commercial organizations is now synonymous with the importance of business analytics. The science of collecting and analyzing business intelligence (BI) and other relevant information to extract insights that can inform decision-making processes, boost operational and cost efficiencies, create a better understanding of markets and consumers, and ultimately boost the bottom line.
Organizations that have figured out how to implement data analytics successfully follow a number of best practices, including:
- Making an assessment of how the data and analytics can best serve the enterprise.
- Deciding on key metrics that are meaningful to the business, and which will simplify data visualizations while allowing executives to focus on relevant values.
- Adopting data models in which calculated fields are consistent, creating hierarchies that allow users to drill down into the data for specifics.
- Creating relevant visualization dashboards, logically ordered, and simple enough so that users aren’t overwhelmed by deluges of information.
- Choosing appropriate data management tools, expertise, and/or technical partners.
Though the term has been part of our general vocabulary since at least 2005, when Roger Mougalas from O’Reilly Media coined it (other sources credit John R. Mashey of Silicon Graphics, in the early 1990s), the history of big data actually has its roots in antiquity. For example, around 300 BC, the ancient Egyptians under Alexander the Great tried to capture all existing stores of information in the library of Alexandria. Later, military scientists of the Roman Empire would analyze field statistics to determine the optimal distribution for their armies.
Both were attempts to aggregate and organize huge repositories of information relevant to the business of the day so that experts and scholars could analyze this data and apply it for practical purposes.
In the modern era, the evolution can be roughly subdivided into three main phases.
During Phase One, emerging database management systems (DBMS) gave rise to data warehousing and the first generation of data analytics applications, which employed techniques such as database queries, online analytical processing, and standard reporting tools. This first phase was heavily reliant on the kind of storage, extraction, and optimization techniques that are common to Relational Database Management Systems (RDBMS).
Phase Two began in the early 2000s, fueled by the data collection and data analysis opportunities offered by the evolving internet and World Wide Web. HTTP-based web traffic generated a massive increase in semi-structured and unstructured data, requiring organizations to find new approaches and storage solutions to deal with these new information types to analyze them effectively. Companies like Yahoo, Amazon, and eBay started to draw insights from customer behavior by analyzing click-rates, IP-specific location data, and search logs. Meanwhile, the proliferation of social media platforms presented new challenges to the extraction and analysis of their unique forms of unstructured data.
Phase Three is being largely driven by the spread of mobile and connected technologies. Behavioral and biometric data, together with location-based information sources such as GPS and movement tracking, are opening up new possibilities and creating fresh challenges for effective data gathering, analysis, and usage. At the same time, the sensor-based and internet-enabled devices and components of the Internet of Things (IoT) are generating zettabytes of data every day, and fueling innovations in the race to extract meaningful and valuable information from these new data sources.
What Are the 7 “Vs” of big data?
In addition to the more commonly known 3 “Vs” (Volume, Velocity, Variety) there are an additional 4 “Vs” that are equally as important (Variability, Veracity, Visualization, Value). The seven “Vs” summarize the concepts underlying the immense amounts of information that organizations now routinely have to deal with and illustrate why it’s necessary to capture, store, and analyze this complex resource. They are:
Volume: The amount of data available. Once expressed in megabytes (MB) or gigabytes (GB), is now typically measured in petabytes (PB), zettabytes (ZB), or even yottabytes (YB) of information. The Internet of Things (IoT) is now contributing immense amounts of data through connected technologies and smart sensors, and the volume of data in the world is projected to double every two years
Velocity: The speed with which data becomes accessible can mean the difference between success and failure. In today’s economy, data velocity has to be as close to real-time as possible to fuel analytics and instantaneous or near-instantaneous responses to market conditions.
Variety: It consists of structured, semi-structured, and unstructured information, with the latter category, including diverse sources such as audio, video, and SMS text messages. It’s estimated that 90% of today’s information is generated in an unstructured manner – and these different kinds of data require different types of analysis.
Variability: Differences in perception and relevance can give the same data set a different meaning when viewed from different perspectives. This variability requires algorithms to understand the context and decode the exact meaning of every record in its specific environment.
Veracity: This refers to the reliability and accuracy of the information available. Besides allowing for greater utilization due to their higher quality, data sets with high veracity are particularly important to organizations whose business centers on information interpretation, usage, and analysis.
Visualization: Any data that’s collected and analyzed has to be understandable and easy to read if it’s to be of any use to all stakeholders in an organization. Visualization of data through charts, graphs, and other media makes information more accessible and comprehensible.
Value: This is a measure of the return resulting from data management or analysis. Handling big data requires a considerable investment in time, energy, and resources – but if it’s done properly, the resulting value can yield considerable profit and competitive advantages for the enterprise.
Examples of data usage in the media and entertainment industries abound – ranging from the digitization and repackaging of content for different platforms, to the collection and analysis of viewing figures, audience behavior characteristics, and feedback, to informing decisions concerning program content, scheduling, and promotion. It is contributing to a media and entertainment market that analysts predict will generate an estimated $2.2 trillion in revenue in 2021.
Predictive analytics and demand forecasting are data analytics examples that enable Amazon and other retailers to accurately predict what consumers are likely to purchase in the future, based on indicators from their past buying behavior, market fluctuations, and other factors. For instance, retailers like Walmart and Walgreens regularly analyze changes in the weather to detect patterns in product demand.
These examples rely on drawing inferences from past and current observations to predict or prescribe courses of action for the future. But some examples of usage are rooted very much in the management of ongoing events. In product recalls, for example, big data helps retailers and manufacturers identify who purchased a product, and allows them to reach out to the affected parties.
Diagnostics, predictive medicine, and population health management are some of the big data examples in healthcare. Expansive databases and data analytics tools are empowering healthcare institutions to provide better clinical support, at-risk patient population management, and cost-of-care measurement. Insights from analytics can enable care providers to pinpoint how variations among patients and treatments will influence health outcomes.
With the cost of genome sequencing coming down, analysis of genomics data is enabling healthcare providers to more accurately predict how illnesses like cancer will progress. At the institutional level, some big data systems are able to collect information from revenue cycle software and billing systems to aggregate cost-related data and identify areas for cost reduction.
Examples in computer applications include the “maps to apps” approach that’s transforming the nature of transportation planning and navigation. Big data collected from government agencies, satellite images, traffic patterns, agents, and other sources can be incorporated into software and platforms that put the latest travel information in the palm of your hand, as mobile apps and portals.
General Electric’s Flight Efficiency Services, recently adopted by Southwest Airlines and used by other airlines worldwide, is an example that’s assisting air carriers in optimizing their fuel usage and planning for safety by analyzing the massive volumes of data that airplanes generate.
Data analytics examples extend to all sectors of the economy. Skupos, a PC-based platform that pulls transaction data from 7,000 convenience stores nationwide, is an example of it at work in the retail industry. Each year, billions of transactions studied using the platform’s business analytics tools are made available to store owners, enabling them to determine location-by-location bestsellers and set up predictive ordering.
The Salesforce customer relationship management (CRM) platform integrates data from various aspects of a business, such as marketing, sales, and services, pulling the information into a comprehensive, single-screen overview. The platform’s Einstein Analytics feature automatically provides insights and predictions on metrics like sales and customer churn, using Artificial Intelligence. Salesforce is one of the big data marketing examples which enables users to connect and integrate with outside data management tools.
What Is Big Data Concept?
A number of concepts surround the treatment and analysis of the diverse range of structured, semi-structured, and unstructured information sources contributing to the estimated 1.7 MB of data created by each person on the planet in each second of 2020.
All data must go through a process called extract, transform, load (ETL) before it can be analyzed. Here, data is harvested, formatted to make it readable by an application, then stored for use. The ETL process varies for each type of data.
For structured data, the ETL process stores the finished product in a data warehouse, whose database applications are highly structured and filtered for specific analytics purposes. In the case of unstructured data, the raw format of the data and all of the information it holds are preserved in data lakes.
Since the compute, storage, and networking requirements for working with large data sets are beyond the limits of a single computer, big data tools must process information through clusters of computers in a distributed manner. Products like Hadoop are built with extensive networks of data clusters and servers, allowing it to be stored and analyzed on a massive scale.
Particularly when dealing with unstructured and semi-structured information, the concept must also take context into account. So, for example, if a query on an unstructured data set yields the number 31, some form of context must be applied to determine whether this is the number of days in a month, an identification tag, or the number of items sold this morning. Merging internal data with external context makes it more meaningful and leads to more accurate data modeling and analysis.
Types of Data
As we’ve already observed, there are three main types: structured, semi-structured, and unstructured.
Structured data is highly organized, easy to define, and work with, having dimensions that are defined by set parameters. This is the kind of information that can be represented by spreadsheets or tabular relational database management systems having rows and columns.
Structured data follows schemas, which define the conditions and paths leading to specific data points. For example, the schema for a payroll database will layout employee identification information, pay rates, number of hours worked, and how payment is delivered. The schema defines each one of these dimensions, for whatever application is using the database.
It’s reckoned that no more than 20% of all data is structured.
Unstructured data is more free-form and less quantifiable, like the kind of information contained in emails, videos, and text documents. More sophisticated techniques must be applied to this type of big data before it becomes accessible to systems and analysts who can yield useful insights. This often involves translating it into some form of structured data.
Unstructured data that’s associated with metadata (information about information, such as timestamps or device IDs) is known as semi-structured. Often, while the actual content of the data (the characters making up an email message, or the pixels in an image, for example) is not structured, there are components that allow the data to be grouped, based on certain characteristics.
Benefits of Big Data
Among the principal benefits for organizations is its potential to improve operational efficiency and reduce costs. This may be due to analytical techniques which identify ways of streamlining or optimizing processes, or predictive or prescriptive analysis that reveals possible problems in time to mitigate or avoid them entirely, or plots courses of action that aid decision-makers in improving business methods and operations.
For commercial organizations, big data analytics using AI and machine learning technologies aids in understanding consumers, improving fulfillment options for product and service delivery, and in crafting and maintaining personalized customer experiences across multiple platforms. These methods help organizations bolster their sales and marketing operations, customer acquisition, and customer retention.
As well as identifying potential risks to the enterprise, analytics also provides an avenue for innovation by establishing opportunities to develop new products and services to fill existing gaps or niches in the market.
Other benefits include supply chain optimization, with data analytics making high-level collaboration between the members of a supplier network possible. This can not only lead to improved logistics and distribution but also open opportunities for innovation and joint ventures between stakeholders in the supply chain.
Future of Big Data
Automation is expected to play a huge role in the future of data analytics, with intelligent and pre-programmed solutions enabling organizations to streamline and simplify the analytics process. Augmented analytics will be one of the facilitating mechanisms for this by integrating machine learning and natural language processing into data analytics and business intelligence.
Augmented analytics can scrub raw data to identify valuable records for analysis, automating certain parts of the process and making the data preparation process easier for data scientists, who habitually spend around 80% of their time cleaning and preparing data for analysis using traditional methods. This will free up more time for data scientists to spend on strategic tasks and special projects.
Relationship analytics is also set to play a part in the future by empowering organizations to make connections between disparate data sources that would appear to have no common ground on the surface. Relationship analytics uses several techniques to transform data collection and analysis methods, allowing businesses to optimize several functions at once.
By bringing together social science, managerial science, and data science into a single field, decision intelligence will put a more human spin on the future of business analytics by using social science to understand the relationships between variables better. Decision intelligence draws from different disciplines such as business, economics, sociology, neuropsychology, and education to optimize the decision-making process.
As IoT devices become more widespread in critical applications, data analysts will expect platforms to generate insights in as close to real-time as possible. Continuous analytics will help accomplish this by studying streaming data, shortening the window for data capture and analysis. By combining big data development, DevOps, and continuous integration, this methodology also provides proactive alerts to end-users or continuous real-time updates.
Machine learning technology is facilitating the augmentation and streamlining of data profiling, modeling, enrichment, data cataloging, and metadata development, making data preparation processes more flexible. This augmented data preparation and discovery automatically adapts fresh data, especially outlier variables, and augmented discovery machine learning algorithms allow data analysts to visualize and narrate their findings more easily.
Predictive and prescriptive analytics are charting a course for the future, in which organizations will be able to improve efficiencies and optimize performance by providing information not only on what will happen in a particular circumstance, but how it could happen better if you perform actions X, Y, or Z. In this field, AI and machine learning are already automating processes to assist businesses with guided marketing, guided selling, and guided pricing.
Looking at the future of data itself, the majority of experts agree that the amount of information available will continue to grow exponentially, with some estimates predicting that the global data store will reach 175 zettabytes by 2025. This will be largely due to increasing numbers of transactions and interactions on the internet, and the proliferation of connected technologies.
In response to this, organizations are expected to migrate more of their data load to the cloud, in mainly hybrid or multi-cloud deployments. As legislative frameworks continue to evolve and adapt to changing circumstances, data privacy and governance will remain high on the agenda for both governments and individual citizens.
The growing landscape will also require solutions to the ongoing skills shortage, which is increasing the demand for data scientists, artificial intelligence engineers, Chief Data Officers (CDOs), and other professionals with the relevant skills for managing and manipulating big data.