How Bad Data Is Undermining Big Data Analytics

GIGO

Adobe Stock

A 2013 article in the New York Times said that 2012 was the breakout year for Big Data. The promise of Big Data was enticing, take digital data from the web that was too large for traditional data processing, apply new software technologies for mining and find the solutions for unlimited problems. Since 2012, Big Data has a checkered record of success. By 2017, Gartner Research reported that the actual failure rate of Big Data projects was close to 85%. Another Gartner Study from 2018 reported that of 196 companies interviewed regarding Big Data and Analytics, 91% haven’t reached transformational business intelligence levels. One of the biggest challenges many face with Big Data is that it is unstructured, often of low quality and inaccurate data. Finding data that represents what it is supposed to from the explosion of options is difficult. Success may come from new sources of relevant quality data that is clean, privacy compliant, accurate and directly from consumer responses.

While many scrambled to get in on the Big Data revolution, the old adage, GIGO, garbage in garbage out, was ignored. With all the new technology tools to process, mine, and analyze the ever-growing amounts of digital data, it seemed as if quality was not as important as quantity.

The quantity of data today is great. It is estimated that 90% of the data in the world was generated in the last 2 years. The exponential growth of data is driving the Data Science and Machine Learning disciplines in global business analytics. While mathematics is the single foundation for each component and discipline that coincides with Data Science; Statistics and probability, Computer Science, Machine Learning, Deep Learning are all applied sciences. Each of these “hard sciences” may find the digital data of higher value when it comes to specific Big Data applications not involving humans.

For instance, several Big Data AI applications that have enjoyed success favor applications that rely on data from devices to predict processes governed by established norms. Applications such as email spam filtering, Auto-complete, Auto-correct, facial recognition, chatbots, optical character recognition, Assistance for medical diagnosis, and robotics. Each of these applications analyze inanimate objects or images.

It seems more likely that Human behaviors may be better understood and analyzed via direct questioning of persons through an anonymous survey instrument. Provided the data is representative, it becomes more factual in nature and not as prone to unverifiable assumptions as observed/surveillance clicks because the data comes “straight from the horse’s mouth”.

MORE FOR YOU

Apple s iPhone 16 Pro Design Revealed In New Leak

Exclusive Employers Are Souring On Ivy League Grads While These 20 New Ivies Ascend

NYT Strands Hints Spangram And Answers For Monday April 29th

We are ending one of the most challenging years in history. A year where all markets were disrupted by Covid-19 and the response by the consumer. Numerous predictive models which were built by analyzing and forecasting historical data are also being disrupted.

However, when it comes to Big Data AI for human beings and their behaviors, attitudes and intentions, many of which are driven by subconscious impulses not measurable by random digital clicks, success has been lagging. This is also where the majority of companies are attempting to find Big Data value by mining and analyzing transactions of customers and customer data files.

For example, transactional data doesn’t identify why a customer bought something, or whether it was a gift for someone. Customer data files have a reputation for being incomplete, full of duplicate or misspelled files, containing inaccurate data from people who have moved, died, married, divorced or changed email addresses.

Two examples of how Big Data doesn’t equate to real world consumer behaviors are apparent in the Industries of; Marketing/Advertising and the Investment/Hedge Funds.

For marketers/advertisers, digital ad spending is now the largest sector of advertising in the US. Despite this rush by marketers to target digitally, the accuracy and efficacy of digital targeting models have been the subject of several high-profile lawsuits by advertisers. Numerous academic research papers have also identified the short comings of the Big Data ad targeting models. A 2018 research paper published in Marketing Science Frontiers detailed how the targeting models built from online browsing and sold by data brokers and DMP’s are not only unreliable but also not accurate. The reliability gets worse when additional parameters are added. A recent Forbes article by Dr. Augustine Fou pointed out that these models are bad because they are all derived from data and deduced from data collected behind the scene without knowledge or consent of the users. In addition, many of these data sets are full of bot data. Putting more garbage into a model only means it has more garbage in but is not better.

The investment community, in particular hedge funds, have also been eager to adopt Big Data under the AKA of Alternative Data. Alternative Data is unlike the data traditionally used by Wall Street such as financial statements from a company or SEC notices. Alternative Data can be from credit card transactions, social media, satellite images, web browsing, etc.

The most successful and first firm to employ Alternative Data is a hedge Fund named Renaissance Technologies founded by Jim Simons. The Man Who Solved the Market: How Jim Simons Launched a Quant Revolution is a book that describes how for the past 30 years Renaissance has delivered a 39% return to investors.

Privacy regulations like GDPR and MFID II have impacted Alternative Data sources and Hedge Fund users currently face many of the same challenges as marketers do when sourcing data. The risks to hedge funds are:

Data Provenance Risks. Are the data procured in accordance with applicable terms and conditions from the originator of the data? Scraping websites for data may violate terms for data on e-commerce sites.
Accuracy/Validity Risks. Validating the accuracy of a data set is difficult and can lead to further unverifiable assumptions being made regarding the data.
Privacy Risks. Users need pay attention to how is the data generated. Is it from specific online transactions or user browsing behaviors which may violate privacy regulations?

As if that is not enough, Bloomberg recently reported that Renaissance Technologies models and returns through October 2020 for their 3 key funds have seen declines from -20% to -27%. Some experts have claimed that “Quants” are relying on models that have no reflection of today’s environment. The models are trained by historical data. Once again bad data in, bad data out.

What to do. Data Scientists need to first consider the accuracy, validity, representativeness and privacy compliance for data sets to be used as inputs. It is difficult to analyze bad data good. Also recognize that all data is not created equal. Human beings are more than a click and consumer data sources that go beyond surveillance techniques are required to enhance and close the gap between digital data and consumer behavior reality. If data accuracy and validity are job one, it follows that outcomes should improve. By paying more attention to the accuracy, quality and validity of the data at the beginning of ML projects, Data Scientists may move beyond the 85% failure rate for Big Data projects.

—-

Note: Although I am not a Data Scientist, I am a data entrepreneur who has created hundreds of consumer data sets which have helped to power predictive analytics for the past 35 years. These data inputs have helped to generate over a billion dollars of revenue for some of the world’s largest and most recognizable brands. My goal is to create a focus on using quality data inputs that are accurate and representative of the real world when beginning a Big Data project especially those pertaining to human behaviors.

To see how a quality and accurate data set can be applied to target marketing models and time series forecasting Prosper has partnered with AWS to make their data available via the AWS DataExchange. Included in the data are a series of US signals, leading indicators, predictive analytics and advance privacy compliant marketing models for the US and China:

To read my previous Forbes articles on changing consumer behavior, predictive analytics, machine learning, data privacy and more, please click here.

Check out my website.

More From Forbes

How Bad Data Is Undermining Big Data Analytics

Apple s iPhone 16 Pro Design Revealed In New Leak

Exclusive Employers Are Souring On Ivy League Grads While These 20 New Ivies Ascend

NYT Strands Hints Spangram And Answers For Monday April 29th