Artikel

MSs Thesis - UvA - Mutlimodal Classification of Citizen Reports with Geo-Spatial Context

MSc Thesis by Reitze Jansen

Reports made by citizens about issues in the public space, henceforth referred to as citizen reports, help contribute to the residential comfort of their cities by increasing the efficiency at which problems are identified and resolved. Some reporting systems, like the Dutch Signalen system (https://signalen.org/) have an automated classification component which does initial categorization. As a result of this users do not need to navigate subcategories and category-specific provisional information can be requested at the time of report submission.

In this work we investigate different aspects - feature use, multimodality, spatial context and data availability - which might be considered for the classification of citizen reports. We find that approaches which use pre-trained features in a more direct manner appear to perform better than those wherein more intermediate feature transformations are learned leading up to classification (e.g. through neural network layers). We compare multiple image representation groups, all including the same text representation in different Early Fusion schemes. We find that using our most general image representation, CLIP image without further additional representations works better than others as well as groups including additional representations. We expect this to be linked to a tendency towards overfitting. Our spatial-context models perform on par compared to our models which do not use this extra information. However, we find there are differences in precision and recall across report density. This leads us to suggest that using spatial context information might still be beneficial but should be obtained in a manner independent of and suggest some representations for this from the literature. The Meldingen production baseline, logistic regression based on TF-IDF text encoding, performs at the highest macro precision and does so disproportially to all other approaches. With respect to accuracy and macro recall, however, we find that a multi-modal CLIP representation using text and images results in the best outcomes with equal performance between models with and without spatial context. Both the baseline and our pre-trained representation approaches outperform the accuracy score reported in previous work.

This research was conducted by Reitze Jansen in collaboration with AI Team, Urban Innovation and R&D, City of Amsterdam and the Vereniging van Nederlandse Gemeenten (VNG).

Involved civil servants: Thijs Coenen & Iva Gornishka

Supervisors: Thijs Coenen & Stevan Rudinac

Aanvullende informatie

Media

Documenten