GITA4CALAMITA - Evaluating the Physical Commonsense Understanding of Italian LLMs in a Multi-layered Approach: A CALAMITA Challenge

Laburpena

In the context of the CALAMITA Challenge, we investigate the physical commonsense reasoning capabilities of large language models (LLMs) and introduce a methodology to assess their understanding of the physical world. To this end, we use a test set designed to evaluate physical commonsense reasoning in LLMs for the Italian language. We present a tiered dataset, named the Graded Italian Annotated dataset (GITA), which is written and annotated by a professional linguist. This dataset enables us to focus on three distinct levels of commonsense understanding. Our benchmark aims to evaluate three specific tasks: identifying plausible and implausible stories within our dataset, identifying the conflict that generates an implausible story, and identifying the physical states that make a story implausible. We perform these tasks using LLAMA3, Gemma2 and Mistral. Our findings reveal that, although the models may excel at high-level classification tasks, their reasoning is inconsistent and unverifiable, as they fail to capture intermediate evidence.

Argitalpena
CLiC-it 2024
Julen Etxaniz
Julen Etxaniz
Hizkuntzaren Azterketa eta Prozesamendua Doktoregoko ikaslea

Hizkuntzaren Azterketa eta Prozesamendua Doktoregoko ikaslea HiTZ Zentroko IXA taldean (UPV/EHU). Hizkuntza ereduak baliabidea urriko hizkuntzetako hobetzeko lanean. Informatika Ingeniaritzan graduatua Software Ingeniaritza espezialitatearekin. Hizkuntzaren Azterketa eta Prozesamendua Masterra.