How LEPISZCZE was Designed and Created: A Comprehensive Benchmark for Natural Language Processing Tasks in Polish
The increasing availability of computational resources and training data for large language models raises the demand for robust evaluation environments to accurately assess progress in language modelling. In recent years, significant progress has been made in standardising evaluation environments for the English language, with environments like GLUE, SuperGLUE, and KILT becoming standard tools for assessing language models. In creating environments for other languages, many researchers have focused on replicating the GLUE environment, as exemplified by the Polish benchmark KLEJ.
The seminar will discuss the work dedicated to the LEPISZCZE tool. The authors provide an overview of efforts to create evaluation environments for low-resource languages, highlighting that many languages still lack a comprehensive set of test data to assess language models. They identify current gaps in evaluation environments and compare tasks available within these environments, referencing English and Chinese, languages with abundant training and testing resources.
The main result of the work is LEPISZCZE – a new evaluation environment for Polish language technology based on language modelling, featuring a diverse set of test tasks. The proposed environment is designed with flexibility to add tasks, introducing new language models, submit results, and versioning data and models. The authors, along with the environment, also present evaluations of several new language models, including both improved datasets from the existing literature and new test sets for novel tasks. The environment includes five existing datasets and eight new datasets that have not been used before to evaluate language models. The article also shares experiences and insights gained from developing the LEPISZCZE evaluation environment, serving as guidance for designers of similar environments in other languages with limited linguistic resources.