![]()
|
Research
Abstracts - 2007 |
|
Nuggeteer: Complex Information Retrieval EvaluationGregory Marton & Alexey RadulThe ProblemEvaluation is a critical component of any development effort, and automatic evaluation lets developers experiment with their systems in many ways, to find out which components provide the greatest contribution. Nuggeteer is an open-source automatic evaluation tool to evaluate systems addressing a new type of complex information retrieval need. Nuggeteer uses existing human judgements to approximate official evaluations, and solicits additional judgements from developers for increased accuracy. MotivationWhile many information retrieval systems focus on returning relevant documents, and question answering has focused on finding exact named-entity answers to simple factoid questions, many interesting questions are too complex to be evaluated either with document lists or answer patterns. A complex question will have multi-part answers, and each part may describe a fact about the world that can be stated in a wide variety of ways. For example, for a TREC definitional question "Who is Jar Jar Binks?", one might want the answer to contain four critical facts
However, these may be stated or implied in any number of ways, for example: "One grating hero in this first Star Wars prequel is the CG creation Jar Jar Binks." The TREC definition, "other", and relationship questions [4] have used expensive human assessors to judge whether each system response (like the sentence above) contains the facts that comprise a complex answer. Nuggeteer uses these judgements to approximate human evaluations automatically, allowing developers to track their systems' successes and failures, and compare different approaches or parameter settings. Previous WorkThe Qaviar system [1] is similar in spirit and implementation to Nuggeteer, with some limitations due to the different task it was created to handle, and it is not generally available. Pourpre [2] was the first freely available automatic evaluation system for the same set of target tasks, and it uses the same idea of weighted keyword overlap matching as Qaviar and Nuggeteer do, but Pourpre's scores do not facilitate comparisons between evaluated systems. Nuggeteer offers three important improvements:
ApproachLike previous evaluation systems, Nuggeteer [3] relies on keyword overlap with known-correct answers to identify probably-correct responses. We optimize our matching over all combinations of various parameter settings that influence the matching process, including stemming, n-gram length, term weighting, stopword removal, and acceptance thresholds. We measure the ranking agreement between Nuggeteer and official scores as our predecessors did, as well as absolute agreement, and confidence intervals. Research SupportThis work is supported in part by the Disruptive Technology Office as part of the AQUAINT Phase 3 research program. References:[4] Ellen Voorhees. Overview of the TREC 2005 question answering track. NIST publication, 2005. |
![]() ![]() |
||
|