BERDS: A Benchmark for Retrieval Diversity for Subjective Questions

New York University
Figure_1

The BERDS benchmark consists of subjective questions and diverse perspectives associated with the questions. Retrieval systems are evaluated based on whether they cover diverse perspectives.

Abstract

We study retrieving a set of documents that covers various perspectives on a complex and contentious question (e.g., would ChatGPT do more harm than good?).

First, we curate a Benchmark for Retrieval Diversity for Subjective questions (BERDS), where each example consists of a question and diverse perspectives associated with the question, sourced from survey questions and debate websites. This task diverges from most retrieval tasks where document relevancy can be evaluated with simple string matches to reference answers. To evaluate the performance of retrievers on this task, we build an automatic evaluator that decides whether each retrieved document contains a perspective.

Our experiments show that existing retrievers struggle to surface diverse perspectives. Re-ranking and query expansion approaches encourage retrieval diversity and achieve substantial gains to base performances. Yet, retrieving diverse documents from a large, web-scale corpus remains challenging, as existing retrievers could only cover all the perspectives with the top 5 documents 30% of the time. Our work presents benchmark datasets and an evaluation framework, laying the foundation for future studies in retrieval diversity handling complex queries.

Baseline Performances

We evaluate the performance of existing retrievers on the BERDS benchmark. We use the following models as baselines: BM25, DPR, and Contriever. MRecall @ k measure the percentage of questions where all perspectives are covered by the top k retrieved documents. Precision @ k measures the percentage of retrieved documents that contain a perspective. (k=5 in this table)

baseline

If you would like to evaluate your own model, follow the instructions in the github repository.