Summary of Discussion Session at HCIR'07
What follows is not a transcript but a very brief summary of the viewpoints expressed during the closing discussion session at HCIR'07. If something you said is not accurately represented, please let us know.
Daniel Tunkelang opened the session by introducing a topic of interest: HCIR evaluation. How can we do it? What directions should we look in for solutions?
Bill Woods: Recall and precision curves are good for comparing search engines, but not for measuring "something vs. nothing" success - whether they return at least one relevant result. He prefers the "success rate" - the frequency with which the user finds an answer in the top 10 hits (essentially, the precision at 10). Bill reminded us that recall is a subjective measurement, and cautioned against reducing success to a single number. More useful is measuring a lot of things and determining "what does this one find that that one didn't, and what results are in common."
David Karger: Evaluation is driven by the task. Many non-traditional tasks still have concrete evaluation methods. More interesting is the class of tasks where you're not even clear on what success is; success needs to be defined before appropriate evaluation methods can be determined.
Bill Kules: We should broaden our idea of evaluation beyond the traditional outcome-oriented measures from IR and HCI. We can also use behavioral measures such as changing patterns of interactions. For example, are we seeing patterns that emerge when people use systems over time? Do the systems support certain tactics? Such evaluation has more traction for exploratory search where you don't have a ground truth.
Mark Maybury: Keep in mind that making joint error measures and optimizing for that is different than optimizing for each measure individually in a pipeline.
Shiry Ginosar: It's important to be able to directly compare interfaces, rather than just reporting "happiness," "success," etc. In IR, there is a standard set of documents that different IR systems can be compared on. It's important to have such a "box" that you can hook up to different interfaces in order to compare them. In this case, the box should include users; we can evaluate the system and the user together [see abstract "Human Computation for HCIR Evaluation" for more detail].
Yifen Huang: We need to have a feedback loop including the user that can be translated back into updating the function that the computer is optimizing for [see abstract "Reasoning and Learning in Mixed-Initiative Tasks" for more detail].
Mike Stonebraker: Most queries come from dynamically generated web pages. He would love to see work done on improving access to dynamically generated content. A lot of work is being done on navigation, but relevance ranking is still very important.
David Karger: A lot of what we're looking at is navigation, but navigation isn't an end - it's a tool.
Daniel Tunkelang: So is search.
Bill Kules: There is work being done in Europe on context and information seeking needs, focusing on the larger organizational goal. Each level has opportunities for evaluation.
Max Van Kleek: Interfaces should be sensitive to users' individual needs - what mood we're in, what task we're doing, etc., in order to tune how much load to give to the user under what conditions.
Mark Maybury: I was struck by how simple the usage models were in the papers presented today - what about building causal models of what users are doing? To me, the stressful part is how to make sense of all of the information coming in. This is very different than "what web page should i go to next." I think there are a spectrum of models capturing different levels of complexity.
Michael Bernstein: Ryen White identified two important use cases. Once you can identify what people are doing, then you can start to improve their lives in that way. "Is this interface better?" is always going to be a question because certain interfaces are better for different things. The important question is under what circumstances are those interfaces useful.
At this point, we ran out of time...
Interesting related work from 1996 (!) is in the Mira Workshop.