Big Data and IoT Data Analysis – Who Has Privacy?

10 June 2014 Internet, IT & e-Discovery Blog Blog
Authors: Peter Vogel

Given the size of the Big Data and IoT (Internet of Things) it is clear that we really have no privacy, just consider Justice Sonia Sotomayor’s opinion that “GPS monitoring generates a precise, comprehensive record of a person’s public movements that reflects a wealth of detail about her familial, political, professional, religious, and sexual associations.”  Justice Sotomayor’s opinion was in the 2012 US v. Jones case in which the US Supreme Court ruled 9-0 that the 4 weeks of GPS data about an alleged drug dealer’s location obtained from a GPS device attached to his car without a warrant, violated the defendant’s Fourth Amendment guarantee of privacy.

A recent issue of the New York University Journal of Law and Liberty included an article about the privacy under the Fourth Amendment entitled “When Enough Is Enough: Location Tracking, Mosaic Theory, and Machine Learning.”  The NYU article was written by Steven M. Bellovin (Professor, Columbia University, Department of Computer Science), Renée M. Hutchins (Associate Professor, University of Maryland Francis King Carey School of Law), Tony Jebara (Associate Professor, Columbia University, Department of Computer Science), and Sebastian Zimmeck (Ph.D. candidate, Columbia University, Department of Computer Science).

The NYU article included a description of “Unsupervised Machine Learning” which “automatically finds dependencies, correlations, and clusters in the data without requiring any significant human intervention. More specifically, it could perform the following operations”:

  • Clustering: In clustering, a system automatically finds groups of users in the dataset that appear statistically similar. For instance, certain individuals may show a pattern of visiting churches on Sundays while others stay home during that time. After application of a clustering algorithm, it becomes relatively easy for a human investigator to observe prototypes from each cluster and figure out which group it represents (for instance, followers of a particular faith, e.g., Christians). The number of groups to be extracted can be fixed (i.e., find the 5 most important groups) or can be automatically estimated. The groupings could be disjoint, overlapping, hierarchical, or nested in various ways. For instance, sub-groups of religious activity (Baptists, Roman-Catholics, Lutherans, etc.) could emerge under a larger umbrella group (Christians).
  • Detection: Given data about individuals as an unbiased sample of the population, a detection system recovers a probability distribution, which says how an individual likely behaves under this sample. This permits an investigator to flag anomalous users in the training data (and in future data) as individuals with a score that is lower than some reasonable threshold. Alternatively, it is possible to identify the handful of users who had the lowest scores as outliers, for example, in a location dataset those who do not exhibit regular location movement. One natural example of an outlier is the mail carrier who spends the workday going door-to-door delivering mail. This is an unusual commute pattern relative to the rest of the population.
  • Visualization and Summarization: Another application of machine learning is visualizing trends in “big data” and highlighting important aspects in it. While each person’s record may contain thousands or millions of bytes of information, a human investigator can only visualize projections of the data in two or three dimensions. Machine learning, however, finds low-dimensional embeddings, which summarize the original data with minimal distortion. For example, the similarities or distances between pairs of visualized low-dimensional embedding-points could be almost equal to the similarities or distances that were measured between pairs of original data points. Alternatively, only the key measurements in the original data points are preserved. For example, from the thousands of latitude and longitude coordinates a user visited that are stored in, it is possible to extract one or two important locations such as the user’s home or place of work.
  • Inference: One of the most powerful unsupervised machine learning techniques is arguably probabilistic inference. In particular, machine learning is able to find dependencies in parts of a collection of data gathered about users. For instance, if we have observed two types of information for many users, say, their location history and web-browsing history, a machine learning system can learn the dependence and correlations between locations and browsing. This allows the system, for example, to fill-in likely browsing patterns for a new user even though only location history for this user was available. Put another way, we can predict a user will probably visit the website frequently if that user has frequently attended sports events at stadiums.

Of course the article includes a discussion of Supervised Machine Learning which is “more laborious to create since it requires human an-notation effort while unsupervised learning is more of a pure data collection exercise. With supervised learning, we can perform the following operations with varying degrees of accuracy”:

  • Classification: One of the most basic supervised machine learning operations is classification, that is, the identification of a category for a new observation. In addition to collecting data, about an individual, classification also requires that we annotate individuals with a discrete label, Collecting such a categorical variable, about an individual often requires some effort, expense, or a need for the subject to volunteer information about themselves. For example, in addition to collecting location data, one may survey a small portion of the population and ask them to report their occupation (student, construction worker, taxi driver, etc). Then, having obtained such labels from the survey, it is possible for a machine learning system to automatically label other individuals using only their location data.
  • Regression: While classification involves obtaining a discrete label, for an individual, regression assumes that the discrete label is a scalar. For instance, instead of a category (such as occupation), we may collect the income that the individual received last year as a numerical value. Machine learning then learns a good prediction function from training examples to accurately estimate the salary, of other individuals directly from their location data. For instance, by getting location data from someone who lives in an expensive neighborhood and works in the financial district, it would be possible to estimate a high income level.
  • Prediction: In prediction, the output, is either discrete (as in classification) or continuous (as in regression), but is also specifically a quantity that is only available in the future after the input raw data, is observed from a user. For example, may be the location (latitude and longitude) that the user will visit tomorrow for lunch. Alternatively, may be the party (Republican or Democrat) that a person will vote for in the next election. By observing a population of users for some time, it may be possible to predict that user will likely go for pizza at the mall in his or her next lunch break. Prediction may help an advertising company determine what ad to target on a mobile device by delivering a relevant message (for instance, to lure the user to a new pizza establishment in the vicinity of his or her next lunch location).

Even with the FTC’s recent call to Congress to control Big Data it seems likely that we have less privacy given the size and scope data analysis of Big Data and IoT.

This blog is made available by Foley & Lardner LLP (“Foley” or “the Firm”) for informational purposes only. It is not meant to convey the Firm’s legal position on behalf of any client, nor is it intended to convey specific legal advice. Any opinions expressed in this article do not necessarily reflect the views of Foley & Lardner LLP, its partners, or its clients. Accordingly, do not act upon this information without seeking counsel from a licensed attorney. This blog is not intended to create, and receipt of it does not constitute, an attorney-client relationship. Communicating with Foley through this website by email, blog post, or otherwise, does not create an attorney-client relationship for any legal matter. Therefore, any communication or material you transmit to Foley through this blog, whether by email, blog post or any other manner, will not be treated as confidential or proprietary. The information on this blog is published “AS IS” and is not guaranteed to be complete, accurate, and or up-to-date. Foley makes no representations or warranties of any kind, express or implied, as to the operation or content of the site. Foley expressly disclaims all other guarantees, warranties, conditions and representations of any kind, either express or implied, whether arising under any statute, law, commercial use or otherwise, including implied warranties of merchantability, fitness for a particular purpose, title and non-infringement. In no event shall Foley or any of its partners, officers, employees, agents or affiliates be liable, directly or indirectly, under any theory of law (contract, tort, negligence or otherwise), to you or anyone else, for any claims, losses or damages, direct, indirect special, incidental, punitive or consequential, resulting from or occasioned by the creation, use of or reliance on this site (including information and other content) or any third party websites or the information, resources or material accessed through any such websites. In some jurisdictions, the contents of this blog may be considered Attorney Advertising. If applicable, please note that prior results do not guarantee a similar outcome. Photographs are for dramatization purposes only and may include models. Likenesses do not necessarily imply current client, partnership or employee status.


Related Services


Foley Weekly Automotive Report
03 August 2021
Dashboard Insights
Podcast Episode 57: Kristel Schorr, Partner
03 August 2021
Foley Career Perspectives
New Facebook Policy Requires Certification and Pre-Approval for Telemedicine Company Advertisements
03 August 2021
Health Care Law Today
Act Now: Employer Obligations Under New York HERO Act
02 August 2021
Labor & Employment Law Perspectives
30th Annual Law of Product Distribution & Franchise Seminar
29 September | 7 & 20 October 2021
Milwaukee | Chicago | Dallas
7th National Telehealth Summit
4-5 October 2021
Miami Beach, FL
AHLA Fraud & Compliance Forum
21-22 September 2021
Baltimore, MD
2nd Clinical Trial Agreements Forum
16-17 September 2021
Online Livestream