Biography

PhD Student in Language Analysis and Processing at Hitz Center IXA Group UPV/EHU. Working on Improving Language Models for Low-resource Languages. Graduate in Informatics Engineering with speciality in Software Engineering. Master in Language Analysis and Processing.

In this web you will find information about  Skills,  Certificates,  Projects,  Tags and  Contact.

Interests
  •  Programming
  •  Web Development
  •  Software Engineering
  •  Machine Learning
  •  Deep Learning
  •  Natural Language Processing
Education
  • Degree in Computer Engineering, 2017-2021

    University of the Basque Country (UPV/EHU)

  • Master in Language Analysis and Processing, 2021-2022

    University of the Basque Country (UPV/EHU)

  • PhD in Language Analysis and Processing, 2023-Present

    University of the Basque Country (UPV/EHU)

 Experience

 
 
 
 
 
UPV/EHU
PhD Student in Language Analysis and Processing
January 2023 – Present Donostia

 Education

 
 
 
 
 
UPV/EHU
Degree in Computer Engineering
September 2017 – September 2021 Donostia
 
 
 
 
 
UPV/EHU
Master in Language Analysis and Processing
October 2021 – October 2021 Donostia
 
 
 
 
 
UPV/EHU
PhD in Language Analysis and Processing
January 2023 – Present Donostia

 Languages

basque-country
Euskara
spain
Español
united-kingdom
English

 Programming Languages

Python
R
Java
JavaScript
PHP
SQL

 Web Development

HTML5
CSS3
Bootstrap
hugo
Hugo
django
Django
dotnet
.NET

 Software Engineering

Requirements
Design
Develop
Test
Methodologies
Source Control

 Machine Learning

Classification
Regression
Neural Networks
jupyter
Jupyter Notebook
scikit-learn
Scikit-Learn
tensorflow
Tensorflow

 Tools

Git
GitHub
xamarin
Xamarin
eclipse
Eclipse
visual-studio-code
Visual Studio Code
visual-studio
Visual Studio
IKER-GAITU: research on language technology for Basque and other low-resource languages
IKER-GAITU: research on language technology for Basque and other low-resource languages

The general objective of the IKER-GAITU project is to research on language technology to increase the presence of Basque in the digital environment. It will be carried out between 2023 and 2025 thanks to a grant from the Department of Culture and Language Policy of the Basque Government. Current techniques require enormous amounts of textual and oral data per language. On the other hand, the data available for Basque and other low-resource languages might not be enough to attain the same quality as larger languages with the current technology. For this reason, it is essential to research on language technology, so that low-resource languages are present with the same quality as the rest of the languages in these technologies. IKER-GAITU pursues the following research objectives: 1. A system that automatically captures the level of Basque proficiency, written and oral; 2. Bring personalized voice technology to people with disabilities; 3. Spontaneous voice transcription, both when Basque and Spanish are mixed and when there are several speakers; 4. Textual conversational systems in Basque that match the quality of the most powerful large language models. In this project summary we present the results for the first year. More information at https://hitz.eus/iker-gaitu.

Latxa: An Open Language Model and Evaluation Suite for Basque
Latxa: An Open Language Model and Evaluation Suite for Basque

We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,774 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses at https://github.com/hitz-zentroa/latxa. Our suite enables reproducible research on methods to build LLMs for low-resource languages.

Projects

*
Image Caption Generation

Image Caption Generation

Automatic Image Caption Generation model that uses a CNN to condition a LSTM based language model.

Shape Classification

Shape Classification

The goal of the project is to compare different classification algorithms on the solution of plane and car shape datasets.

100Iragarki

100Iragarki

Your digital showcase Network Services and Applications 2019-2020

Academic Website

Academic Website

Academic personal website that includes a short description, social links, biography, interests, education, skills, experience, accomplishments, projects and contact info.

Antxieta Arkeologi Taldea Website

Antxieta Arkeologi Taldea Website

Antxieta Arkeologi Taldea website, a non-profit cultural group that develops archaeological research in Gipuzkoa.

BattleshipFeatureIDE

BattleshipFeatureIDE

Java Battleship FeatureIDE Software Product Line.

Community Detection

Community Detection

NIPS kongresuko autoreen komunitateak detektatzen metaheuristikoak erabiliz.

Comparing Writing Systems

Comparing Writing Systems

Comparing Writing Systems with Multilingual Grapheme-to-Phoneme and Phoneme-to-Grapheme Conversion.

Computational Syntax

Computational Syntax

Computational Syntax slides and exercises.

Corpus Linguistics

Corpus Linguistics

Corpus Linguistics slides, labs, assignments and data.

Deep Learning for Natural Language Processing

Deep Learning for Natural Language Processing

Deep Learning for Natural Language Processing slides, labs and assignments.

Dialbot

Dialbot

Ikasketa sakonean oinarritutako muturretik muturrerako solasaldi sistema.

Egunean Behin Visual Question Answering Dataset

Egunean Behin Visual Question Answering Dataset

This is a Visual Question Answering dataset based on questions from the game Egunean Behin. Egunean Behin is a popular Basque quiz game. The game consists on answering 10 daily multiple choice questions.

GitHub Website

GitHub Website

GitHub personal website that includes a photo, short description, social links and GitHub repositories and topics.

Grounding Language Models for Spatial Reasoning

Grounding Language Models for Spatial Reasoning

Grounding Language Models for Spatial Reasoning

HackerRank Challenge Solutions

HackerRank Challenge Solutions

Solutions for programming challenges in multiple languages.

Hyperpartisan News Analysis With Scattertext

Hyperpartisan News Analysis With Scattertext

Hyperpartisan News Analysis With Scattertext

Machine Learning and Neural Networks labs

Machine Learning and Neural Networks labs

Machine Learning and Neural Networks labs.

Machine Learning and Neural Networks lectures

Machine Learning and Neural Networks lectures

Machine Learning and Neural Networks lectures.

Machine Learning exercises with R

Machine Learning exercises with R

Machine Learning exercises with R.

Mejorando la seguridad de mi web

Mejorando la seguridad de mi web

Analizaré mi web con herramientas como Hardenize y Security Headers para detectar los aspectos de seguridad que se pueden mejorar.

MFDS

MFDS

Métodos Formales de Desarrollo de Software.

NLP Applications I - Text Classification, Sequence Labelling, Opinion Mining and Question Answering

NLP Applications I - Text Classification, Sequence Labelling, Opinion Mining and Question Answering

NLP Applications I - Text Classification, Sequence Labelling, Opinion Mining and Question Answering slides, labs and project.

NLP Applications II - Information Extraction, Question Answering, Recommender Systems and Conversational Systems

NLP Applications II - Information Extraction, Question Answering, Recommender Systems and Conversational Systems

NLP Applications II - Information Extraction, Question Answering, Recommender Systems and Conversational Systems slides, labs and project.

ProMeta

ProMeta

Metaereduetan oinarritutako softwarearen garapenerako prozesuen definizio eta ezarpenerako sistema.

ProMeta IO-System

ProMeta IO-System

ProMeta proiektua IO-System.

ProMeta ModelEditor

ProMeta ModelEditor

ProMeta proiektua ModelEditor.

Quiz

Quiz

Question game Web Systems 2019-2020

Spiking Neural Network

Spiking Neural Network

Simulating the Izhikevich spiking neuron model using the Brian2 software

Twitter Sentiment and Emotion Analysis

Twitter Sentiment and Emotion Analysis

Twitter Sentiment and Emotion Analysis.

Zero-shot and Translation Experiments on XQuAD, MLQA and TyDiQA

Zero-shot and Translation Experiments on XQuAD, MLQA and TyDiQA

Zero-shot and Translation Experiments on XQuAD, MLQA and TyDiQA

 Contact