What is Big Data?
Big Data is a collection of data that is huge in volume and yet growing exponentially with time. It is data that is so huge in size and complexity that traditional data management tools cannot store or process it efficiently.
Examples of big data
Big data can be found everywhere. Within the mobile telecommunication industry, billions of mobile users all over the world generate telecommunication data abundantly. Using big data analytics, service providers can choose an area to strengthen a network, or offer discounts based on historical usage and geospatial data. Healthcare also generates a huge amount of data via millions of users with personalised devices, such as smartwatches and smartphones. Hospital records also provide a rich source of information about patients’ conditions – analysis of which allows care providers to provide better services to patients.
Types of big data
Generally speaking, raw data is unstructured, meaning that it does not have a known form or structure. Large amounts of unstructured data pose multiple challenges in terms of processing for deriving value. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos, etc. After going through a process called ‘pre-processing’, the unstructured data can become semi-structured or structured. Structured data has a dedicated data model and well-defined structure, making it easy to store in a database and access over time. Semi-structured data inherits certain properties of structured data, but does not have a definite structure in its major parts.
Characteristics of big data
The main characteristics of big data include three dimensions: volume, velocity, and variety. Two additional characteristics of big data can be considered to form the five V’s, namely veracity and value.
(i) Volume – The term ‘big data’ itself refers to an enormous size. For example, according to Brandwatch , as of 2014, Facebook stores 300 petabytes of data which equates to hundreds of millions of gigabytes which is a thousand times more than what a home computer can store.
(ii) Velocity – The term “velocity” refers to the speed of data generation. How fast the data is generated, and processed to meet demands, determines real potential in the data. Again, taking Facebook as an example, 350 million new photos are uploaded per day. It is already challenging to store those photos, but they must also be processed quickly and be made easily accessible.
(iii) Variety – The third aspect of big data is its variety. It refers to heterogeneous sources and the nature of data, both structured and unstructured. Nowadays, data comes in the form of emails, photos, videos, PDFs, audio, etc. All of these can be evaluated in the analysis process. This variety of unstructured data poses certain issues for storage, mining, and analysing data.
(iv) Veracity – This covers the quality of data. As data comes in various forms, there is no guarantee that it is clean (meaning, for example, that there are no accidental repetitions, incomplete or otherwise irrelevant items), accurate, and valuable. Therefore, it requires a processing pipeline to transform the raw data into a clean, useable format.
(v) Value – Finally, this refers to the ability to transform the huge amount of data into valuable insights. If a company correctly uses big data, they get to know more about their customers and can monetize using those resources. For example, online retail businesses can harness customers’ purchase history to predict which customer segments are likely to return and buy in the future, which allows the business to target the customer with more deals relevant to them.
Extracting insights from data
Understanding data at any size or format can be challenging. It takes years of experience to gain in-depth knowledge about different forms of data. However, many modern data platforms support users to store and extract simple, yet useful, insights from data. Generally, when we encounter a certain behaviour or we define a target outcome, we want to understand why such behaviour happens or how to direct actions towards our desired outcome. For example, an online retail business such as Amazon may want to provide coupons for a certain group of users, or they may want to know why their revenue for a line of product has decreased in the latest month. To search for the answers, we need a systematic analysis, which will be described below.
The first stage is descriptive analysis, which performs data cleaning and pre-processing. Descriptive analysis creates simple reports, graphs, and other visualisations which allow users to understand what happened at a particular point by summarising the past data. For a single quantitative variable, we can create a histogram or a box-and-whisker plot to grasp an idea about its distributed values. We can also calculate the minimum and maximum values, which helps identify outliers. For a combination of two variables, a scatter plot or a contingency table can be shown if both variables are quantitative or qualitative, respectively. If one of the variables is qualitative, we can calculate summary statistics of the quantitative variable based on the qualitative variable. Those statistics are easy to extract using modern data platforms, such as Microsoft Azure or Google Cloud Platform.
The next stage, diagnostic analysis, gives deeper insight into a specific problem, whereas the previously mentioned descriptive analysis is more of an overview. This form of analysis focus on answering why a behaviour happens. To do that, we find correlations between variables and our desired outcome. For instance, with an online retail business, when revenue increases or decreases, we may want to know which feature leads to such behaviour. Is there something that happens on social media leading to a sudden change, or is it a regular behaviour of customers at that time during the year or within specific locations? Incorporating domain knowledge into this step allows us to view the most relevant variables and drill down to explain the behaviour. It also empowers the retail business with the relevant data insight to make changes to its product profile.
Up until this point, we have been looking back at past data in order to grasp a good picture of our preferred behaviour and our data. Now, in the third stage, we build a predictive model to project what may happen next. This step requires a strong background of machine learning and statistical models to apply correctly to historical data and get reliable estimates about the future relevant to our data. However, no statistical algorithm can predict the future perfectly, so we also need to provide a confidence score for every prediction.
The last stage, prescriptive analysis, is relatively new. It generalises predictive analysis by quantifying the effect of future decisions to predict possible outcomes before making those decisions. It also makes use of statistical and machine learning models, albeit in a more complex way, to tell why an outcome will happen and provide recommendations for corresponding actions.
Extracting useful insights from big data is not easy. The key is to ask appropriate questions and try to answer them with the help of suitable big data analytics techniques. Modern data platforms readily provide various tools to play with data and extract low-level information, while extracting high-level information requires a deeper understanding of domain knowledge and technical skills. In both cases, big data provides us with the base which we can obtain useful insights for a variety of purposes across the full range of people’s and entities’ behaviour.
Duc Xuan Nguyen
ARC Industrial Transformation Research Hub for Digital Enhanced Living PhD scholarship recipient
Applied Artificial Intelligence Institute (A2I2), Deakin University
NB: The author reserves the right to showcase/publish this blog piece elsewhere and/or in a different medium.
Editorial review by:
Ms Sharon Grocott, Partner Investigator
Ms Kate Olivieri, Talented Writer working with Sharon.
Kevin Hoon, Hub Manager