GITA4CALAMITA - Evaluating the Physical Commonsense Understanding of Italian LLMs in a Multi-layered Approach: A CALAMITA Challenge
dic. 6, 2024·,,,,·
0 min de lectura
Giulia Pensa
Ekhi Azurmendi
Julen Etxaniz
Begoña Altuna
Itziar Gonzalez-Dios

Resumen
In the context of the CALAMITA Challenge, we investigate the physical commonsense reasoning capabilities of large language models (LLMs) and introduce a methodology to assess their understanding of the physical world. To this end, we use a test set designed to evaluate physical commonsense reasoning in LLMs for the Italian language. We present a tiered dataset, named the Graded Italian Annotated dataset (GITA), which is written and annotated by a professional linguist. This dataset enables us to focus on three distinct levels of commonsense understanding. Our benchmark aims to evaluate three specific tasks: identifying plausible and implausible stories within our dataset, identifying the conflict that generates an implausible story, and identifying the physical states that make a story implausible. We perform these tasks using LLAMA3, Gemma2 and Mistral. Our findings reveal that, although the models may excel at high-level classification tasks, their reasoning is inconsistent and unverifiable, as they fail to capture intermediate evidence.
Tipo
Publicación
CLiC-it 2024
Natural Language Processing
Large Language Models
Deep Learning
Evaluation
Commonsense Reasoning
Italian
Autores
Autores
Autores
Autores
Autores