Does More Data Equal Better Analytics?
- John Morrell
- July 1, 2021
In our modern world of everything digital and big data, organizations are flush with available data assets that can be used for analytics. We will set aside engineering all this data for the moment and look at a different problem – or benefit if you’re a glass half full type of person – how does more data equal better analytics?
Google’s Research Director Peter Norvig has been quoted many times on the value of more data in analytics:
“We don’t have better algorithms. We have more data.”
“More data beats clever algorithms, but better data beats more data.”
“Simple models and a lot of data trump more elaborate models based on fewer data.”
Once you hear statements such as these from a world-renowned expert, the logical next two questions are “where?” and “how?” The comments from Peter Norvig are heavily slanted towards data science as that was his main job and area of expertise. But more data can deliver better results in BOTH data science and traditional analytics. Let’s explore these.
More Data = More Features
Let’s start in the world of data science. The first and perhaps most obvious way in which more data delivers better results in data science is the ability to expose more features to feed your data, science models. In this case, accessing and using more data assets can lead to “wider datasets” containing more variables.
Uniting more datasets into one helps the feature engineering process in two ways. First, it gives you more raw variables that can be used as features. Second, it gives you more fields that you can combine to make derived variables.
It is important to note that the brute force approach of throwing more features at a model is NOT the objective. That would be over-engineering the model. The aim is to explore as many features as possible to find their fit for the problem at hand and choose the best parts.
More Data = Better Training
AI and machine learning models are only as good as the data you use to train the model. And to most people, the natural conclusion is that the larger the volume of data – “longer datasets” – I throw at a model, the better my model will “learn.” While a good goal, one needs to be careful with this and explore two areas: variance and bias.
A situation called high variance can occur when we have added too many features to the model – over-engineered it as we discussed above – and don’t have enough data to train the model well. This situation can be fixed by simplifying the model and throwing more significant data volumes at it.
Another case is high bias, where the model is too simple with not enough variables or relevant features. In this situation, throwing more data at the model will not make things better. The better approach is to do as prescribed above – explore more data to find the right features, and then throw more data at the model.
More Data = More Dimensions and Measures to Explore
In traditional analytics, more data can help as well. In ad-hoc analytics, you are trying to answer new questions that the business asks or re-ask questions with a high degree of variability in the answers based on this situation. Bring more data together allows you to explore it much more in-depth to find the right answer.
By uniting more datasets and creating more comprehensive data, you have more dimensions to explore and more significant measures that can be rolled up. More data can also give you a more substantial number of values in particular fields that can also be explored. This combination lets you “fail fast,” meaning I explore various analysis paths rapidly. If that doesn’t produce the answer, you quickly explore another direction until the best solution is created.
Be careful, however, as some BI tools have limitations on the size of data sets and the number of variables one can explore. Excel, the most popular analysis tool in the world, certainly has its’ limitations. Large-scale data exploration requires a robust data infrastructure that facilitates the volume of data.
More Data = Wider Purview
Adding more data to analysis can also help gain a broader and more complete perspective on a business problem. The more data I add from different aspects of a problem, the full view I have. It can help create what is often referred to in the analytics world as a 360-degree view.
Prime examples of this are in customer analytics: customer experience, customer behavior, customer retention. For instance, in each of these use cases, if I only have data from some channels but not all, I have blind spots that may keep me from getting the most accurate answers. The more data added, the broader the purview to the problem, creating increased accuracy and trust in the results.
More Data = More Detailed Results
Many of the business’s new analytic questions are trying to answer “why” and “how” questions. Perhaps a dashboard showed metrics that varied greatly from the norm. So immediately, the business wants answers that explain why or how the situation is happening. And they also wish “actionable results” telling them what to do about the situation, which requires adding a great deal of detailed data to the analytics to dig deep and find the in-depth answers the business is seeking. In this case, we are creating a broader dataset to explore more variables and find the right set of variables that influence the situation, creating actionable results that explain not just the why and how, and more importantly, the “what” – what to do.
A prime example of this comes up in marketing analytics. A dashboard may show which marketing campaigns are performing better than others and which are performing poorly. Making adjustments is not as simple as continuing the good ones and shutting down the bad ones.
In this case, the business wants the detailed aspects of the campaigns analyzed to determine the best course of action. Are there aspects of the marketing channel that are making campaigns succeed or fail? Demographic characteristics of the targets? Features of the offers?
With these details, the business can make the proper adjustments and action plans to adjust the marketing mix. Given swift answers – within hours – can also eliminate wasted costs incurred on the low marketing campaigns because the business had to wait for the answers.
More Data = Better Segmentation
Related to the problem above, adding more data to the mix helps create better segmentation models in general. It is done both with more comprehensive data and a greater volume of data.
Creating broader data will add more variables to the equation that can be used for segmentation. Teams can explore algorithmically (e.g., clustering, decision trees) or visually. And using more comprehensive data will add a more generous amount of time to the analysis and improve segmentation accuracy.
As we have seen, adding more data to your analysis will help you produce better results. It is not just from just broadly adding more data, but also finding the right data to fit your problem and build a trusted product. Adding more data will help in data science problems improve accuracy. It will explore detailed why and how questions, produce actionable results and gain a broader purview on various analytic situations in traditional analytics.
Datameer SaaS Data Transformation platform gives data engineers, data analysts, and data scientists the ability to easily transform and combine raw data into deeper, wider, and more actionable analytics datasets. The multi-persona UI, with no-code, low-code, and code (SQL) tools, brings together your entire team – data engineers, analytics engineers, analysts, and data scientists – on a single platform to collaboratively transform and model data. Catalog-like data documentation and knowledge sharing facilitate trust in the data and crowd-sourced data governance. Direct integration into Snowflake keeps data secure and lowers costs by leveraging Snowflake’s scalable compute and storage.