Why big data collection needs to broaden beyond social media mining

Data Science   |   
Published July 10, 2020   |   

When we talk about using “Big Data”, we normally mean using “Social Media Data”. In fact, it’s so common to talk about big data as narrowly referring to data generated by social media platforms that the two terms have become almost synonymous in both tech and the social sciences.
The tools available for working with big data also reflect this problematic elision. When we talk about gathering big data sets, we talk about the best social media scraping tools. Comparing the security of big data platforms typically means comparing social media security protocols. And the marketing insights that big data analysis is (rightly) valued for are often social-media specific as well, such as the observation that visual content and social media are well matched.
The problem, as we will explain in this article, is that social media is not “real life.” Despite this, because of the ease of working with social media data, it has become the almost exclusive source of data for big data analysts. This is a problem because when we think we are dealing with information on our customers, too often we are still actually dealing with meaningless data points.

  • Our Online Lives

There is a cliché that our lives are now lived entirely online. Like most cliches, it contains a grain of truth: our communication, and the way in which we interact with brands, has been hugely affected by the digital revolution. Nonetheless, there are many reasons to believe that the “people” on social media are not real people at all.
One is that, as much recent research has found, the views that we express on social media are not representative of the spectrum of opinions we actually hold. Related to this fact is another – that the articles, groups, people, and brands that we interact with online only represent a tiny fraction of the information we are exposed to. We shouldn’t forget, after all, that traditional news media still has a much greater reach, both in terms of demographics and geographical area, than even the biggest social media platforms.
As a result, trying to make predictions on consumer behavior based on social media data is difficult even if it’s assumed that all users are acting in good faith. In addition, it is almost impossible if they are not. A growing percentage of social media users now use pseudonyms, or simply fake accounts, in order to preserve their online anonymity. Doing so defeats the ability of big data analysts to generate any value from social media data.

  • How “Big” is “Big”?

Why, then, has social media data become so closely associated with the idea of big data?
Well, there is one obvious reason. Social media data is certainly “big.” Or at least it is by some metrics. Social media platforms tend to advertise how much data they have in terms of the number of petabytes currently sitting on their servers: by 2014 Facebook advertised that its data warehouse held more than 300 petabytes and grew at a rate of 4 petabytes per day.
This makes them extremely attractive for researchers and analysts looking for data that is truly “big.” For comparison, the entirety of the New York Times’ total output from 1945 to 2005 consisted of just 5.9 million articles totaling a scant 2.9 billion words, whereas a month of the Twitter Decahose in 2012 contained 2.8 TB of data, including 112.7 GB of text containing over 14.3 billion words.
The problem with measurements like this is that they don’t capture how meaningful such datasets are, or even how much of this data is unique. In the NYT, an article appears only once, and might contribute 1000 words to the total data available for analysis. In comparison, a single 10-word Tweet, shared 100,000 times, will be counted as a million words of data. Analyzing this dataset will certainly allow a big data analyst to claim they are working with a  large dataset, but it’s likely that their conclusions will be of a fairly mundane type: a particular tweet was shared a lot.

  • Data vs. Information

In proposing a way to overcome this difficulty, it’s worth revisiting a distinction that you probably last heard as a freshman – between “data” and “information.” Without getting into the technical aspects of information entropy and related fields, we can say that “information” is generally unique, useful, and insightful, and data may not be.
This distinction is particularly appropriate when it comes to analyzing social media datasets, because many of them are largely made up of fairly meaningless data. You might be able to see, for instance, how many times a particular news article has been shared, and by who, but in a world where the majority of links shared on social media are never even read by the person sharing them, merely blindly forwarded on by title alone, this doesn’t mean much.
Unfortunately, access to genuinely informative information is sometimes hard to secure for big data analysts. In order to improve the ability of our big data systems to make accurate predictions on consumer behavior, we desperately need to widen our scope beyond social media data. Data on healthcare interactions, for instance, or textual processing techniques that can actually extract meaning from social media posts, would help.

  • The Future

In fact, it’s tempting to conclude that big data won’t really come of age until it can start comparing social media datasets with those generated from other sources. Our dependence on these datasets is explicable – after all, social media platforms make their money from making them accessible – but is problematic nonetheless. In other words, if data can indeed improve creativity, we need to start getting creative with our sources as well as the way we process big data.