Large-scale behavioral log analysis allows us to do many things, including build better search engines, predict health epidemics, and support communities through crises. But they can also be hard to come by. This post covers some of the different types of publicly available behavioral logs available for study.
Logs of Public Behavior
The content that people post publically on social media websites provide a rich, up-to-the-minute picture of what they are talking about, what they reading, and how they are connecting. Example datasets that researchers have studied to understand people's public behavior include:
- Twitter (e.g., tweets, re-tweets, replies, favorites, location, profiles)
- Facebook (e.g., posts, comments, likes, friends)
- Wikipedia (e.g., article content, edits, profiles)
- Microsoft Academic (e.g., papers, authors, citations)
Community Generated Logs
There have also been efforts by the research community to create logs for the community to study of computer-mediated behavior that is not public. For example, the purpose of the Lemur Community Query Log Project was to create a query log that could be used by the information retrieval research community. Participants in the project were asked to install a toolbar and consent to having their queries and clicks collected. The plan was that all of the queries collected across all of the participants would be released to researchers in a controlled manner. Unfortunately, despite significant community interest in public search datasets, almost nobody installed and used the toolbar. The project eventually ended because, after a year of data collection, they collected only as much data as Google collects in 6 seconds.
Publically Released Private Logs
Finally, there are a handful of publically released private logs available for study. These include:
purchased by Andrew McCallum for $10,000 after the company went bankrupt. Many of these datasets have since been redacted due to privacy concerns – but that is a discussion for a future blog post.
Edited to add additional datasets people have pointed me to:
NYC taxicab trips for 2013
Mobile network data in Senegal