Large Language Models as Span Annotators

1Charles University, Prague, Czechia
2ETH Zurich, Switzerland
3Edinburgh Napier University, Scotland, United Kingdom
4trivago N.V., Düsseldorf, Germany
5TU Darmstadt, Germany
Direct assessment vs. span annotation

Examples of span annotation tasks that we automate with LLMs.

Abstract

Span annotation is the task of localizing and classifying text spans according to custom guidelines. Annotated spans can be used to analyze and evaluate high-quality texts for which single-score metrics fail to provide actionable feedback. Until recently, span annotation was limited to human annotators or fine-tuned models. In this study, we show that large language models (LLMs) can serve as flexible and cost-effective span annotation backbones. To demonstrate their utility, we compare LLMs to skilled human annotators on three diverse span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We demonstrate that LLMs achieve inter-annotator agreement (IAA) comparable to human annotators at a fraction of a cost per output annotation. We also manually analyze model outputs, finding that LLMs make errors at a similar rate to human annotators. We release the dataset of more than 40k model and human annotations for further research.

How to cite us

@misc{kasner2025large,
    title={Large Language Models as Span Annotators}, 
    author={Zdeněk Kasner and Vilém Zouhar and Patrícia Schmidtová and Ivan Kartáč and Kristýna Onderková and Ondřej Plátek and Dimitra Gkatzia and Saad Mahamood and Ondřej Dušek and Simone Balloccu},
    year={2025},
    eprint={2504.08697},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2504.08697}, 
}