Taja Kuzman, Tanja Pavleska, Urban Rupnik and Primož Cigoj
Abstract
Retrieval-augmented generation (RAG) is a recent method for
enriching the large language models’ text generation abilities
with external knowledge through document retrieval. Due to
its high usefulness for various applications, it already powers
multiple products. However, despite the widespread adoption,
there is a notable lack of evaluation benchmarks for RAG systems,
particularly for less-resourced languages. This paper introduces
the PandaChat-RAG – the first Slovenian RAG benchmark established on a newly developed test dataset. The test dataset is based
on the semi-automatic extraction of authentic questions and answers from a genre-annotated web corpus. The methodology for
the test dataset construction can be efficiently applied to any of
the comparable corpora in numerous European languages. The
test dataset is used to assess the RAG system’s performance in retrieving relevant sources essential for providing accurate answers
to the given questions. The evaluation involves comparing the
performance of eight open- and closed-source embedding models,
and investigating how the retrieval performance is influenced
by factors such as the document chunk size and the number of
retrieved sources. These findings contribute to establishing the
guidelines for optimal RAG system configurations not only for
Slovenian, but also for other languages.