Friday, May 23, 2014

Related Work: Find It If You Can

Find It If You Can: A Game for Modeling Different Types of Web Search Success Using Interaction Data
Mikhail Ageev, Qi Guo, Dmitry Lagun, Eugene Agichtein
SIGIR 2011

A lot of recent search research makes use of large-scale query log analysis. This poses a challenge for researchers who are not affiliated with a major search engine, because there are very few public logs available research community. Find It If You Can, by Ageev et al., provides a nice example of how information retrieval researchers might gather their own search logs. And in doing so, it also addresses a persistent challenge with using logs: the fact that while we know what users are doing, we don't actually know what they are thinking.

Ostensibly, the Find It If You Can paper builds a general model for search success, instantiates it for different query subsets, and uses it to predict search success. In doing so, the authors learn a number of interesting things, including:
  • Both expert and non-expert searchers tend to start off a search session with a good query.
  • But experts are more likely to identify the relevant results returned by the good query.
  • In contract, non-experts have a harder time recognizing good results, or even finding what they are looking for within a good result.
  • For hard search topics, people are more likely to start with a bad query.
  • With hard searches, it is also hard to recognize a correct result or find the answer in a correct result.
But my favorite part of the paper is how it gathered the data that was used for this analysis. The authors created a search "game" that gave players a set of 10 fact-finding questions (e.g., "Where and Who exactly did buy louisiana from Napoleon?" [sic]). A player's interactions with the search engine were logged, as were the URL and answer that they found.

[Note: The system is available for use by other researchers. For example, we used it in a recent CIKM paper that studied how people interact with dynamic search results.]

Players were asked to answer difficult but concrete questions selected from Yahoo! Answers. Because the authors used pre-selected questions, the logs collected were inherently somewhat artificial. People weren't doing their own, self-motivated tasks. However, the artificial setup provided a number of advantages, including:
  • The end goal was known a priori.
  • Many different people performed the same search with the same end goal.
  • It is easy to evaluate whether the search was successful and a correct answer found.
  • All of the data collected represents difficult fact-finding searches, a tail behavior that only occurs in a small fraction of more naturalistic search logs.
Because the tasks were provided, motivation was clearly a challenge. Two hundred Mechanical Turk workers were paid $1 to play, and given another $1 bonus if they played the game particularly well. The authors found this bonus was necessary to keep players working on challenging questions, but I would guess the observed behavior is somewhat different from self-motivated search behavior regardless.

As is typical of all log analysis, data cleaning was important even for this actively constructed dataset. However, at times the cleaning needed to be done differently (e.g., to drop unmotivated workers).

It is interesting to think about whether there was additional information the researchers might have collected to facilitate their analyses. For example, the paper identifies expert searchers by their behavior, which is consistent with previous log-based studies of expertise. Players who answered eight or more questions correctly (out of 10) were considered high performers, while those who answered fewer than five correctly were low performers. But another way to make this split would have been to directly ask people how often they search. Additional feedback about the search process could have been collected as well (e.g., relevance assessments of individual pages) and mapped to the observed behavior.

Of course, the more the researcher intervenes in the search process, the more artificial the behavior becomes. For this reason, I really appreciate the fact that the paper provides some limited validation of its dataset against naturalistic logs collected from the Emory library. It also does some detailed comparison with related work (e.g., a paper by White and Morris on search expertise). Beyond direct comparisons of the logged data, it may also have been worthwhile to capture demographic information about the players to get an idea of how representative they were of searchers in general.

I look forward to seeing more studies like this!

No comments:

Post a Comment