General Claim dataset
The General Claim dataset is a diverse, harmonized dataset created for the task of check-worthy claim detection, addressing the limitations of narrow, specialized datasets currently used in the field. Constructed from five pre-existing datasets, it emphasizes variability across topics, languages, and writing styles, covering content in 10 languages. This dataset enables broader applicability and consistency in training and evaluating models for identifying claims that merit fact-checking.
In our study , we surveyed the existing datasets and models for the first time, for the task of Check-worthy claim detection. As the state-of-the-art is focused on narrow sub-problems limited in topical, linguistic and formal variability, we constructed a general claim dataset that would fit the task more broadly. It is composed of multiple existing datasets with emphasis on the variability of topics, languages and writing styles. Utilizing the obtained general claim dataset, we provide a novel comparison of the state-of-the-art large language models (LLMs), fine-tuned specifically for the task of Check-worthy claim detection.
We placed emphasis to ensure the variability of included topics, languages and writing styles. Based on these criteria, we selected five datasets to be included in our General Claim dataset. To ensure the consistency and comparability of the samples, we carefully harmonized the datasets in terms of included languages and number of samples. Initially, we filtered out any empty or duplicate samples and aggregated samples from all 10 languages encompassed by our selected datasets.