Big data: can overemphasis on data scalability compromise data quality?

Data Science   |   
Published July 18, 2019   |   

I first heard the term “big data” five years ago. The concept has really changed our lives in spectacular ways.
Unfortunately, the term itself might be leading decision-makers astray. They believe that the value of big data is predicated almost entirely on its volume.
The people that have sensationalized the concept of big data deserve some of the blame. They have focused overwhelmingly on the fact that storing massive amounts of data makes certain objectives more attainable. While this is technically true, there are still other factors that need to be considered. Besides the volume of the data under their control.
A couple of years ago, CMSWire tried to draw attention to this issue. They pointed out several reasons that data quality is more important than data volume. We decided to extrapolate their message and focus on ways that data scalability could compromise the integrity of a big data campaign. Brands that understand these challenges can factor for them and get a cutting edge with their big data strategy.

Data scalability heuristics that must be controlled

Organizations have minimum data volume requirements for any project that relies on machine learning or AI. However, data volume is not usually the most important constraint. In fact, organizations that focus too heavily on amassing more data could compromise data integrity in several ways. Here are a few examples of this fallacy.

Procuring data from too many heterogeneous sources without controlling for consistency

There are many different places to accumulate data from. Don’t get me wrong – there is nothing wrong with pulling data from multiple sources. However, it is important to control for discrepancies. Here are some issues with data compatibility that have to be monitored:

  • Data from website polls with different types of biases built into their polling models
  • Data from various sources without trying to control for demographic discrepancies
  • and data from .CSV files with different fields

It is important to develop a methodology for controlling for incompatibilities in your data sets.

Neglecting to institute any data quality scoring model

Most decisionmakers are probably familiar with the term “garbage in garbage out.” This concept is highly applicable to big data projects. Intel AI discussed the dangers of creating AI solutions based on biased or dirty data.
There is a tremendous difference in quality between various types of data. You can’t develop a coherent data strategy without taking this into account.
A recent article on online pulling for the 2020 democratic presidential primary addresses this. The poll was brigaded by members of 4Chan and other troll forums. The point of the article was that online polls are easily subject to manipulation.
Any organization relying on big data must determine the likelihood that the data could be manipulated or erroneous. They don’t necessarily need to entirely eliminate lower quality data. However, they need to account for data quality in their models. Therefore, they should use some kind of dampening effect for data that is less reliable.

Neglecting to consider context with older data sets

All data is going to have a life expectancy. Unfortunately, there aren’t any universal rules on the relevance of data over time. The quality of older data is going to depend on the specific project.
One example to consider is a political party using big data to forecast outcomes on the electoral map. They might retain election data going back decades. But, is older data still going to be valuable?
The answer to that question is going to depend on the specific objectives of the campaign.  If the goal is to track evolving positions on issues over time, then that data might be useful. However, if the goal is to project the outcomes of electoral contests in specific regions in the upcoming election, then there is going to be a shelf-life on certain data.  Data from recent polls and the previous election might be relevant, but older data might need to be omitted from the model.

Date quality over data volume: always

There are different factors when considering a data strategy. So, you must pay attention to both data volume and data quality. Any model that neglects to control for both is doomed to fail from the start.