We study retrieving a set of documents that covers various perspectives on a complex and contentious question
(e.g., would ChatGPT do more harm than good?).
First, we curate a Benchmark for Retrieval
Diversity for Subjective questions (BERDS), where each example consists of a question and
diverse perspectives associated with the question, sourced from survey questions and debate websites. This task
diverges from most retrieval tasks where document relevancy can be evaluated with simple string matches to
reference answers. To evaluate the performance of retrievers on this task, we build an automatic evaluator that
decides whether each retrieved document contains a perspective.
Our experiments show that existing retrievers
struggle to surface diverse perspectives. Re-ranking and query expansion approaches encourage retrieval diversity
and achieve substantial gains to base performances. Yet, retrieving diverse documents from a large, web-scale
corpus remains challenging, as existing retrievers could only cover all the perspectives with the top 5 documents
30% of the time. Our work presents benchmark datasets and an evaluation framework, laying the foundation for
future studies in retrieval diversity handling complex queries.