TU Berlin

Quality and Usability LabReviewed Conference Papers

Page Content

to Navigation

Reviewed Conference Papers

go back to overview

Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation
Citation key iskender2020c
Author Iskender, Neslihan and Polzehl, Tim and Möller, Sebastian
Title of Book Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
Pages 164–175
Year 2020
Location online
Address online
Month nov
Publisher Association for Computational Linguistics (ACL)
Series EMNLP | Eval4NLP
How Published Fullpaper
Abstract One of the main challenges in the development of summarization tools is summarization quality evaluation. On the one hand, the human assessment of summarization quality conducted by linguistic experts is slow, expensive, and still not a standardized procedure. On the other hand, the automatic assessment metrics are reported not to correlate high enough with human quality ratings. As a solution, we propose crowdsourcing as a fast, scalable, and cost-effective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set. Our results provide a basis for best practices for crowd-based summarization evaluation regarding major influential factors such as the best annotation aggregation method, the influence of readability and reading effort on summarization evaluation, and the optimal number of crowd workers to achieve comparable results to experts, especially when determining factors such as overall quality, grammaticality, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness.
Link to publication Link to original publication Download Bibtex entry

go back to overview

Navigation

Quick Access

Schnellnavigation zur Seite über Nummerneingabe