Artikel

MSc Thesis - UvA - Assessing LLMs for Long Dutch Document Classification - Femke Bakker

Master Thesis by Femke Bakker

The Woo (Open Government Act) requires Dutch government organizations to publicize information about government activities. However, these documents are not organized. Document classification can help automate the process, making documents easier to find and more accessible.

This research examines the capability of LLMs for Dutch document classification. Given the limited research on LLMs’ capabilities in Dutch NLP tasks, our work provides insights into their performance in this specific area.

We compare GEITje, trained on a Dutch corpus, to Llama-2 and Mistral, which are not specifically trained on Dutch data. First, we determine the optimal truncation threshold to shorten the documents. Then, we compare the performance of zero-shot and few-shot prompts using in-context learning. This is followed by a comparison of fine-tuning versus in-context learning, and finally, a comparison of the LLMs to Linear
SVM and Naïve Bayes.

The findings show that in an in-context learning setting the few-shot prompt yields the best performance. However, the fine-tuning approach is more accurate and faster than the in-context learning approach. While GEITje achieves the best performance in the in-context learning setting, Mistral outperforms it when fine-tuning is applied. Since GEITje is Mistral-based, this means that the prior Dutch training that GEITje received was an advantage in the in-context learning setting, but this advantage disappeared after fine-tuning. Linear SVM performed competitively with fine-tuned LLMs while also being significantly faster.

This research was conducted by Femke Bakker in collaboration with AI Lab, Innovation Department, City of Amsterdam.

Involved civil servants: Iva Gornishka

Supervisors: Ruben van Heudsen & Iva Gornishka 

 

Aanvullende informatie

Afbeelding credits

Header afbeelding: MSc Thesis 2024 Femke Bakker - LLMs for Long Doc Clsf - banner

Icon afbeelding: MSc Thesis 2024 Femke Bakker - LLMs for Long Doc Clsf - icon

Media

Documenten