MSc Thesis - UvA - No Labels? No Problem! On using Active Learning for Imbalanced Low-resource Multi-class Document Classification

21 July 2022

Iva Gornishka

Master Thesis by Emiel Steegh

In many governmental organizations, the necessity to regain control of document archives exists. Different laws (such as the Archiefwet 2021) determine what information should be managed, and how, and what documents need to become publicly available. Together with the Municipality of Amsterdam, we explored their archival problem and how Natural Language Processing can assist in solving it.

The biggest challenge in training a model for labeling assistance is the lack of quality labeled data and the large set of imbalanced classes. The texts are all in Dutch, generally written formally, and tend to include domain jargon. This means at least part of the data exists in a low-resource domain; it is neither English nor part of the standard training data of typical language models. On top of that, the classes are imbalanced; they are very unequally represented.

In this work we explore the use of Active Learning (a method aiming to reduce the labeling burden by selecting more informative samples for labeling) for the task of multi-class classification of
documents in an imbalanced low-resource setting.

Furthermore, to help future research in this field and ease result comparison, we present a dataset for the task of low-resource classification of dutch legal documents at two levels of imbalance.

This research was conducted by Emiel Steegh (LinkedIn) in collaboration with AI Team, Urban Innovation and R&D, City of Amsterdam.

Involved civil servants: Ymkje Galama & Iva Gornishka.

Supervisors: Ymkje Galama & dr. Giovanni Sileno

Additional info

Image credits

Icon image: Icon - MSc Thesis 2022 Emiel Steegh No Labels? No Problem! On using active learning to reduce the necessary labeled documents

Media

Documents