Evaluation

Natural Language Processing Large Language Models Deep Learning Evaluation Commonsense Reasoning Italian

Challenging the Abilities of Large Language Models in Italian: a Community Initiative

The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of …

arXiv

•

Malvina Nissim

Danilo Croce

Viviana Patti

Pierpaolo Basile

Giuseppe Attanasio

Elio Musacchio

Matteo Rinaldi

Federico Borazio

Maria Francis

Jacopo Gili

others

• Dec 4, 2025 • 1 min read

arXiv PDF Code

Natural Language Processing Large Language Models Deep Learning Evaluation Multilinguality Basque Linguistic Diversity

BERnaT: Basque Encoders for Representing Natural Textual Diversity

Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model …

arXiv

•

Ekhi Azurmendi

Joseba Fernandez de Landa

Jaione Bengoetxea

Maite Heredia

Julen Etxaniz

Mikel Zubillaga

Ander Soraluze

Aitor Soroa

• Dec 3, 2025 • 1 min read

arXiv PDF Code Dataset Model

Natural Language Processing Large Language Models Deep Learning Evaluation Multilinguality Basque Multimodal

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource …

arXiv

•

Lukas Arana

Julen Etxaniz

Ander Salaberria

Gorka Azkune

• Nov 12, 2025 • 1 min read

arXiv PDF Dataset

Natural Language Processing Large Language Models Deep Learning Multilinguality Truthfulness Evaluation

Truth Knows No Language: Evaluating Truthfulness Beyond English

We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations …

ACL 2025

•

Blanca Calvo Figueras

Eneko Sagarzazu

Julen Etxaniz

Jeremy Barnes

Pablo Gamallo

Iria de-Dios-Flores

Rodrigo Agerri

• Jul 27, 2025 • 1 min read

URL PDF Code Dataset Model

Natural Language Processing Large Language Models Deep Learning Evaluation Commonsense Reasoning Italian

GITA4CALAMITA - Evaluating the Physical Commonsense Understanding of Italian LLMs in a Multi-layered Approach: A CALAMITA Challenge

In the context of the CALAMITA Challenge, we investigate the physical commonsense reasoning capabilities of large language models (LLMs) and introduce a methodology to assess their …

CLiC-it 2024

•

Giulia Pensa

Ekhi Azurmendi

Julen Etxaniz

Begoña Altuna

Itziar Gonzalez-Dios

• Dec 6, 2024 • 1 min read

PDF Code Dataset

Natural Language Processing Large Language Models Deep Learning Evaluation Multilinguality Culture Basque

BertaQA: How Much Do Language Models Know About Local Culture?

Large Language Models (LLMs) exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how …

NeurIPS Datasets and Benchmarks 2024

•

Julen Etxaniz

Gorka Azkune

Aitor Soroa

Oier Lopez de Lacalle

Mikel Artetxe

• Jun 11, 2024 • 1 min read

PDF Code Dataset arXiv

Natural Language Processing Large Language Models Deep Learning Evaluation Reproducibility

Lessons from the Trenches on Reproducible Evaluation of Language Models

Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation …

ArXiv

•

Stella Biderman

Hailey Schoelkopf

Lintang Sutawika

Leo Gao

Jonathan Tow

Baber Abbasi

Alham Fikri Aji

Pawan Sasanka Ammanamanchi

Sidney Black

Jordan Clive

Anthony DiPofi

Julen Etxaniz

Benjamin Fattori

Jessica Zosa Forde

Charles Foster

Jeffrey Hsu

Mimansa Jaiswal

Wilson Y. Lee

Haonan Li

Charles Lovering

Niklas Muennighoff

Ellie Pavlick

Jason Phang

Aviya Skowron

Samson Tan

Xiangru Tang

Kevin A. Wang

Genta Indra Winata

François Yvon

Andy Zou

• May 23, 2024 • 1 min read

PDF Code arXiv

Natural Language Processing Large Language Models Evaluation Data Contamination Deep Learning

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data …

EMNLP 2023 Findings

•

Oscar Sainz

Jon Ander Campos

Iker García-Ferrero

Julen Etxaniz

Oier Lopez de Lacalle

Eneko Agirre

• Oct 27, 2023 • 1 min read

PDF arXiv

No results found

Evaluation