9 Problems with Big Data

10 April 2014 Internet, IT & e-Discovery Blog Blog
Authors: Peter Vogel

There may be too many headlines about “combining the power of modern computing with the plentiful data of the digital era, it promises to solve virtually any problem — crime, public health, the evolution of grammar, the perils of dating — just by crunching the numbers” as opined in the New York Times.  This opinion was written by New York University professor of psychology Gary Marcus who is an editor of the forthcoming book “The Future of the Brain” and professor of computer science Ernest Davis, and they offer this advice:

Big data is here to stay, as it should be. But let’s be realistic: It’s an important resource for anyone analyzing data, not a silver bullet.

Professors Marcus and Davis suggest the following problems with Big Data:

The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.

Second, big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement. Molecular biologists, for example, would very much like to be able to infer the three-dimensional structure of proteins from their underlying DNA sequence, and scientists working on the problem use big data as one tool among many. But no scientist thinks you can solve this problem by crunching data alone, no matter how powerful the statistical analysis; you will always need to start with an analysis that relies on an understanding of physics and biochemistry.

Third, many tools that are based on big data can be easily gamed. For example, big data programs for grading student essays often rely on measures like sentence length and word sophistication, which are found to correlate well with the scores given by human graders. But once students figure out how such a program works, they start writing long sentences and using obscure words, rather than learning how to actually formulate and write clear, coherent text. Even Google’s celebrated search engine, rightly seen as a big data success story, is not immune to “Google bombing” and “spamdexing,” wily techniques for artificially elevating website search placement.

Fourth, even when the results of a big data analysis aren’t intentionally gamed, they often turn out to be less robust than they initially seem. Consider Google Flu Trends, once the poster child for big data. In 2009, Google reported — to considerable fanfare — that by analyzing flu-related search queries, it had been able to detect the spread of the flu as accurately and more quickly than the Centers for Disease Control and Prevention. A few years later, though, Google Flu Trends began to falter; for the last two years it has made more bad predictions than good ones.

A fifth concern might be called the echo-chamber effect, which also stems from the fact that much of big data comes from the web. Whenever the source of information for a big data analysis is itself a product of big data, opportunities for vicious cycles abound. Consider translation programs like Google Translate, which draw on many pairs of parallel texts from different languages — for example, the same Wikipedia entry in two different languages — to discern the patterns of translation between those languages. This is a perfectly reasonable strategy, except for the fact that with some of the less common languages, many of the Wikipedia articles themselves may have been written using Google Translate. In those cases, any initial errors in Google Translate infect Wikipedia, which is fed back into Google Translate, reinforcing the error.

A sixth worry is the risk of too many correlations. If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant — even though there is no actual meaningful connection between the variables. Absent careful supervision, the magnitudes of big data can greatly amplify such errors.

Seventh, big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions. In the past few months, for instance, there have been two separate attempts to rank people in terms of their “historical importance” or “cultural contributions,” based on data drawn from Wikipedia. One is the book “Who’s Bigger? Where Historical Figures Really Rank,” by the computer scientist Steven Skiena and the engineer Charles Ward. The other is an M.I.T. Media Lab project called Pantheon.

FINALLY, big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like “in a row”). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

Wait, we almost forgot one last problem: the hype. Champions of big data promote it as a revolutionary advance. But even the examples that people give of the successes of big data, like Google Flu Trends, though useful, are small potatoes in the larger scheme of things. They are far less important than the great innovations of the 19th and 20th centuries, like antibiotics, automobiles and the airplane.

The more we know about Big Data the better we are able to value it.

This blog is made available by Foley & Lardner LLP (“Foley” or “the Firm”) for informational purposes only. It is not meant to convey the Firm’s legal position on behalf of any client, nor is it intended to convey specific legal advice. Any opinions expressed in this article do not necessarily reflect the views of Foley & Lardner LLP, its partners, or its clients. Accordingly, do not act upon this information without seeking counsel from a licensed attorney. This blog is not intended to create, and receipt of it does not constitute, an attorney-client relationship. Communicating with Foley through this website by email, blog post, or otherwise, does not create an attorney-client relationship for any legal matter. Therefore, any communication or material you transmit to Foley through this blog, whether by email, blog post or any other manner, will not be treated as confidential or proprietary. The information on this blog is published “AS IS” and is not guaranteed to be complete, accurate, and or up-to-date. Foley makes no representations or warranties of any kind, express or implied, as to the operation or content of the site. Foley expressly disclaims all other guarantees, warranties, conditions and representations of any kind, either express or implied, whether arising under any statute, law, commercial use or otherwise, including implied warranties of merchantability, fitness for a particular purpose, title and non-infringement. In no event shall Foley or any of its partners, officers, employees, agents or affiliates be liable, directly or indirectly, under any theory of law (contract, tort, negligence or otherwise), to you or anyone else, for any claims, losses or damages, direct, indirect special, incidental, punitive or consequential, resulting from or occasioned by the creation, use of or reliance on this site (including information and other content) or any third party websites or the information, resources or material accessed through any such websites. In some jurisdictions, the contents of this blog may be considered Attorney Advertising. If applicable, please note that prior results do not guarantee a similar outcome. Photographs are for dramatization purposes only and may include models. Likenesses do not necessarily imply current client, partnership or employee status.

Authors

Related Services

Insights

A Review of Recent Whistleblower Developments
19 July 2019
Legal News: Whistleblower Developments
Cloud security inadequate for Cyber threats, are you surprised?
19 July 2019
Internet, IT & e-Discovery Blog
Blockchain: A Tool With a Future in Healthcare
18 July 2019
Health Care Law Today
Do You Know What IMMEX Stands For?
16 July 2019
Dashboard Insights
Review of 2020 Medicare Changes for Telehealth
11 December 2019
Member Call
2019 NDI Executive Exchange
14-15 November 2019
Chicago, IL
MAGI’s Clinical Research Conference
29 October 2019
Las Vegas, NV
Association for Corporate Counsel Annual Meeting 2019
27-30 October 2019
Phoenix, AZ