BERDS: A Benchmark for Retrieval Diversity for Subjective Questions

We study retrieving a set of documents that covers various perspectives on a complex and contentious question (e.g., would ChatGPT do more harm than good?).

First, we curate a Benchmark for Retrieval Diversity for Subjective questions (BERDS), where each example consists of a question and diverse perspectives associated with the question, sourced from survey questions and debate websites. This task diverges from most retrieval tasks where document relevancy can be evaluated with simple string matches to reference answers. To evaluate the performance of retrievers on this task, we build an automatic evaluator that decides whether each retrieved document contains a perspective.

Our experiments show that existing retrievers struggle to surface diverse perspectives. Re-ranking and query expansion approaches encourage retrieval diversity and achieve substantial gains to base performances. Yet, retrieving diverse documents from a large, web-scale corpus remains challenging, as existing retrievers could only cover all the perspectives with the top 5 documents 30% of the time. Our work presents benchmark datasets and an evaluation framework, laying the foundation for future studies in retrieval diversity handling complex queries.

BERDS: A Benchmark for Retrieval Diversity for Subjective Questions

The BERDS benchmark consists of subjective questions and diverse perspectives associated with the questions. Retrieval systems are evaluated based on whether they cover diverse perspectives.

Abstract

Baseline Performances