Wednesday, October 22, 2014

Data Banks

Each of us individually create a huge amount of data online. Some of this data we create explicitly, such as when we make webpages or public facing profiles, write emails, or author documents. But we also create a lot of data implicitly as a byproduct of our interactions with digital information. These implicit data includes the search queries we issue, the webpages we visit, and our online social networks.

The data we create is valuable. We can use it to understand more about ourselves, and services can use it to personalize our experiences and understand people’s information behavior in general. But despite the fact that we are the ones who create the data, much of it is not actually in our possession. Instead, it resides with companies that provide us with online services in exchange for it. A handful of powerful companies have a monopoly on our data.
Definition of monopoly: the exclusive possession or control of the supply or trade in a commodity or service
Definition of data monopoly: the exclusive possession or control of the supply or trade in an individual’s personal data
When a company that makes use of data to provide a service has a data monopoly, competitors cannot provide the same quality of service because they just don't have the same amount of information. Further, data monopolies are self-reinforcing; the fact that a company can provide the best service enables that company to collect the most additional data, which in turn can be used to further improve their service.

Google provides a good example of a data monopoly. Search engines incorporate usage data into their ranking algorithms. The more people who search using a search engine, the better the ranking. As a result, it is almost impossible for a new, unused search engine to rank documents well. Bing was only able to enter the scene because it had a powerhouse like Microsoft behind it to help drive through a period of significant data sparsity. Facebook is another good example. When your social network is owned by Facebook, the company is able to provide significant value that you cannot bring with you to another site. New companies without data don’t stand a chance in the face of data monopolies.

There are other challenges to data monopolies as well. Because our data is valuable to the companies that collect it, it often ends up fragmented across services. For example, Bing knows some of the search queries I issue, and Google knows a different set. They can each use the fraction of my queries that they have to improve their search results and personalize my experience, but neither can use all of my queries to create the best personalized experience possible. And I, meanwhile, cannot get at any of my queries without relying on both of the companies to return them to me. Additionally, academic research is inhibited by researchers’ lack of access to proprietary company data. As a result, innovation is stifled.

However, even if companies wanted to share the data they collect, it would be very difficult. Our personal data is not something individuals would like to share with other people, emerging companies, or even academic researchers. We tend to be willing to share some information with a service provider if it enables better experiences, but even as we do, we worry about breaches in our trust and the collecting entity’s ability to secure our data. We also worry about sharing information now that might be used against us at some unforeseen point in the future. If we could find a solution to data monopolies that enabled us to maintain control over our data, our trust in the system could enable us to share more information and receive better services.

One potential solution is to build data banks that serve as a trusted third party to collect and aggregate personal usage data. The data bank could then allow us access to our data as needed, share it with the companies we choose to enable personalized experiences, and sell anonymous data aggregated across users to companies that want large-scale usage data. Through the bank, the use of data could be audited to ensure any one individual’s data was not being used inappropriately. Further, individuals can receive a portion of the proceeds from the sale of aggregate data in the form of monetary interest.
Definition of bank: a business that keeps money for individual people or companies, exchanges currencies, makes loans, and offers other financial services
Definition of data bank: a business that keeps data for individual people or companies, aggregates data, makes information loans, and offers other information services
Data banks would allow us to have a complete picture of all of our data. For example, I would love to view all of the emails I ever received from mother at the same time. Now I can just see all of the emails I received at one account. Likewise, it might be nice to see all of the searches I have ever run, not just the queries I issued to a particular search engine. If our data were collected in single place, we would be able to easily access and use it. We could also selectively share it, retroactively deleting information we don’t want others to have (example: embarrassing pictures from college) as desired.

One reason personal behavioral data has value to companies is that they can use it to uniquely create a personalized experience, in a way that other companies without access to the same data cannot. A challenge with data monopolies is that that companies can use them to their advantage to lock users in and keep competitors out. People get locked in to companies because the companies own their data. If we are able to claim ownership of our own data using data banks, we could avoid getting stuck with companies because they own our data. Instead, when we join a new service we could grant access to our collection of relevant personal usage data.

Data banks would also allow us to monetize our data. Currently, we give our personal behavioral information to companies in exchange for the services we receive. However, this transaction is implicit. Data banks would let us get explicit value from our data instead. Companies that want to use aggregate usage data could purchase it from the bank, enabling new companies to start providing high quality services from the get go. The money made from the purchase could then be shared back with us, as the data owners.

Of course, there are also a lot of challenges that make data banks unlikely to happen anytime in the near future. Companies that currently own data are unlikely to want to give up their monopolies. Additionally, a successful data bank would require a lot of trust and must provide security and transparency, making it possible for others to audit how data is used. And while our behavioral data is valuable, logged data can be hard to understand outside of the context in which it was collected. Different systems log different content, and a lot of the data gathered is very fine grained and system dependent. The state of the system matters for understanding the data, and it may take time to identify good. But it is fun to imagine our data transactions made explicit in a way that breaks existing data monopolies to enable new opportunities for end users, companies, and researchers.

Wednesday, October 8, 2014

Help! I'm Sexist!

The research studies I posted last Friday about the role gender plays in the STEM workplace paint a consistent picture: women face significant discrimination. Women are paid (and hired, and tenured) less than men with the same qualifications, and these gender differences are particularly large for parents. While women are often encouraged to address the existing disparities by advocating for themselves (e.g., by being assertive, negotiating, or encouraging diversity), research shows this type of behavior typically incurs a further penalty.

Instead, gender disparities in the STEM workplace are a problem that the entire community must address. Hiring managers need to hire more women. Managers need to promote more women. And peers need to accept diverse communication styles without the lens of gender.

Importantly, however, this does not just mean that MEN need to hire (and promote, and accept) more. Because the other consistent picture that arose from the studies I posted on Friday is that both men AND WOMEN discriminate against women. We all have deep seated biases that contribute to the problem.

Friday, October 3, 2014

Research about Gender in the STEM Workplace

Science Faculty’s Subtle Gender Biases Favor Male Students by Corinne A. Moss-Racusina et al.
In a study with 127 science faculty at research-intensive universities, candidates with identical resumes were more likely to be offered a job and paid more if their name was "John" instead of "Jennifer." The gender of the faculty participating did not impact the outcome.

How Stereotypes Impair Women’s Careers in Science by Ernesto Reuben et al.
Men are much more likely than women to be hired for a math task, even when equally qualified. This happens regardless of the gender of the hiring manager.

Measuring the Glass Ceiling Effect: An Assessment of Discrimination in Academia by Katherine Weisshaar
In computer science, men are significantly more likely to earn tenure than women with the same research productivity. [From a summary]

Wednesday, August 13, 2014

Evidence from Behavior


Doug Oard at the Information School at the University of Maryland is teaching an open online course on information retrieval this fall (INST 734). Above is the brief cameo lecture I recorded using Office Mix for the segment on Evidence from Behavior.

Tuesday, July 29, 2014

The #GreatWalk Recap

Cale and I completed our 100 mile #GreatWalk from Bellevue, WA to Great Wolf Lodge. We live-blogged on Twitter as we walked, and I have recorded our tweets on this blog in chronological order to make them easy to read. Thanks for sharing our journey with us!
  • Day 1: We depart!
  • Day 2: A long walk to the airport
  • Day 3: Getting tired and frustrated
  • Day 4: A candy discovery
  • Day 5: Skirting the military base
  • Day 6: A wet and rainy day
  • Day 7: Into the wilderness
  • Day 8: We arrive at Great Wolf!
  • Day 9: A day of rest
  • Day 10: The trip home
Some interesting external links about the adventure:

#GreatWalk: Day 10

[This post includes my tweets (@jteevan) from the tenth day (July 27, 2014) of Cale and my 100 mile walk to Great Wolf!]

#GreatWalk: Day 9

[This post includes my tweets (@jteevan) from the ninth day (July 26, 2014) of Cale and my 100 mile walk to Great Wolf!]

#GreatWalk: Day 8

[This post includes my tweets (@jteevan) from the eighth day (July 25, 2014) of Cale and my 100 mile walk to Great Wolf!]

#GreatWalk: Day 7

[This post includes my tweets (@jteevan) from the seventh day (July 24, 2014) of Cale and my 100 mile walk to Great Wolf!]

#GreatWalk: Day 6

[This post includes my tweets (@jteevan) from the sixth day (July 23, 2014) of Cale and my 100 mile walk to Great Wolf!]