Tuesday, May 6, 2014

The Dangers of Sharing Log Data

A lot of my research relies on analyzing behavioral log data, including query logs (example: personal navigation), web browser logs (example: web revisitation patterns), social media logs (example: #TwitterSearch), IM logs (example: impact of availability state), and GPS traces (example: trajectory-aware search). Behavioral logs provide a picture of human behavior as seen through the lens of the system that captures and records user activity.

However, behavioral logs can also provide a picture of a specific individual, and as such raise privacy concerns. Would you be willing to share your query history with me? I search for fairly mundane things, but even so, there’s no way I’d share an unfiltered version of my queries with you. As a result, despite good intentions several companies that have tried to make behavioral logs available to the research community have ended up in hot water. The two best known examples are AOL and Netflix.


The disastrous – but intentional – release of some AOL query logs paints a picture just how dangerous sharing private behavioral log data can be for companies. On August 4, 2006, shortly before the SIGIR 2006 conference, AOL released some search logs to the academic community. The logs consisted of 20 millions queries from 650 thousand users that had been issued over the course of three months. Each line represented a signal search action:
<User ID; Query text; Query time; Rank of clicked item; URL of clicked item>.
The dataset was really exciting for the research community because, unlike previous datasets, the logs contained anonymized User IDs that would allow researchers to track queries across sessions, making significant research into personalization possible.

Just a few days later, on August 7, 2006, AOL pulled the logs due to privacy concerns. But it was too late. The logs were already mirrored on a number of sites, and remain available today.

By August 9, 2006, the New York Times had identified a user in the dataset. In a story titled “A Face Is Exposed for AOL Searcher No. 4417749,” AOL User 441749 was shown to be Thelma Arnold, a 62 year old woman from Lilburn, GA. To find her, the reporter first found a user with a series of queries from a user for businesses and services in a small town in Georgia, population 11,000. That same user had also issued multiple queries for people with the last name Arnold. There were only 14 people named Arnold in Lilburn, GA, and the reporter called them all. When he contacted Thelma she acknowledged the queries were hers.

Of course, Thelma is not the only famous user in the AOL query logs. Nor are her queries particularly revealing. On the other hand, User 927 shows just how dark a person’s query history can be, issuing a series of very graphic queries. (These queries eventually inspired a play by Katharine Clark Gray.) And User 711391’s unhappy marriage and affair are captured in a short online movie titled “i love alaska."

On August 21, 2006, less than three weeks after the data was released, two AOL employees were fired because of the data release and the CTO resigned. By September 2006, a class action lawsuit was filed against AOL asking for $5,000 per user in the dataset, or over three billion dollars.

As this example clearly demonstrates, removing a user’s identifier does not make the logs anonymous. The AOL logs contained a lot of directly identifiable information, including names (e.g., “Arnold”), phone numbers, credit card numbers, and social security numbers. They also contained information that were not identifiable on their own but that could, in combination, be used to identify someone. For example, birthdate, gender, and zip code alone are enough to uniquely identify 87% of all Americans.


On Oct 2, 2006, following in AOL's footsteps, Netflix announced the Netflix Prize, an open competition for developing new collaborative filtering algorithms for predicting a user’s ratings for films, along with a $1 million prize for the first team to beat the current Netflix algorithm by 10%. The data consisted of 100 million ratings from 480,000 users of 17,000 movies. Specifically, it contained:
<Movie ID; User ID; Rating; Date of rating>
<Movie ID; Title; Year>.
In the wake of the AOL debacle, significant care was taken in the data released to the public, including the introduction of random noise. The following answer is posted on the Netflix Prize FAQ in response to the question, “Is there any customer information in the dataset that should be kept private?”
“No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy... Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one tenth of our complete dataset) and that data was subject to perturbation. Of course, since you know all your own ratings that really isn’t a privacy problem is it?”
Nonetheless, the data was quickly de-anonymized in a paper published by Narayanan and Shmatikov. While it is true that there wasn’t identifying information directly available in the Netflix dataset, infrequently rated movies and time made it possible to identify users by linking the private dataset to other, identified datasets like IMDB. This same approach had previously been used to de-anonymize hospital records using public voter registration records.

The million dollar Netflix Prize was awarded on Sept 21, 2009 to an AT&T Research Team called BellKor, and another competition was announced. But a few months later Netflix was sued for invasion of privacy by a lesbian woman who was concerned that her private movie choices (e.g., linking Brokeback Mountain) revealed her sexual orientation in a way that negatively impacted her life. By Mar 12, 2010 the second competition was cancelled.

Netflix exposed itself to considerable risk even when it thought it was doing everything necessary to protect its users’ privacy. Thus it is not surprising that companies are now unwilling to share private behavioral data with researchers. Unfortunately this means that a lot of interesting research is limited by access to data. I hope as a community that we are able to figure out a way to share data in a way that doesn’t compromise privacy, but don’t really know what that would look like.

Teevan. Using Large Scale Log Analysis to Understand Human Behavior. Keynote at JITP 2011.
Dumais, Jeffries, Russell, Tang & Teevan. "Understanding User Behavior through Log Data and Analysis." In Ways of Knowing in HCI.

No comments:

Post a Comment