Accessibility of urban spaces is a key issue that directly impacts the quality of life of an estimated 16% of the global population, who live with significant disabilities. Streets and sidewalks offer opportunities for transportation, economic activity, and social interaction. Explicit accessibility is mandated by the European Accessibility Act (EAA). This is important because pervasive inaccessibility results in urban barriers that often perpetuate social exclusion and inequality [1]. In an effort to tackle these issues and advocate for inclusivity, the City of Amsterdam launched the Amsterdam for All initiative in partnership with Project Sidewalk [2]. Project Sidewalk uses AI to evaluate and enhance city accessibility, thereby informing the planning of accessible routes.
Existing approaches employ computer vision techniques to automatically assess sidewalk conditions from streetscape imagery [3,4,5], reducing manual labour. However, these techniques need large-scale, high-quality and diverse training datasets, which are labour-intensive and extremely costly to develop. In an attempt to mitigate this issue, this project explores the potential of self-supervised techniques in a computer vision pipeline to assess sidewalk conditions, specifically to localise obstacles. These can be misplaced objects or structural barriers that obstruct the sidewalks of Amsterdam.
Self-Supervised Learning
To understand the promise of self-supervised learning helping us to rely less on large and costly labelled datasets, Figure 1 illustrates the painter metaphor. Contrarily to the conventional supervised learning, where the student learns with the help of a teacher, here the student is presented with the task of completing a portion of a painting given the rest of it, without explicit guidance. Through this process, it captures the inherent structures and patterns within the data, almost like 'understanding' what a sky or a house looks like.
With the same logic, a self-supervised trained model can ‘understand’ what an object looks like and help us identifying obstacles on the sidewalk.
Project Sidewalk
Project Sidewalk is a web-based crowdsourcing tool developed by The University of Washington. They turned the process of collecting sidewalk accessibility data into an interactive gamified platform (Figure 2). Anybody can virtually stroll through a city, and spot and tag accessibility issues. This is possible through a combination of Google Street View images and user interaction. Gamification elements like missions and leaderboards encourage users to produce quality data.
In our work, we use a dataset of panoramic street-level images from Project Sidewalk’s API. Each image is associated with point labels (the circular tags in Figure 2) of different accessibility feature categories. Some examples are curb ramps, surface problems or obstacles. All the labels are crowdsourced via Project Sidewalk’s platform. As the research focuses on obstacles, the final input is the subset of images with at least one obstacle point label.
Computer Vision Pipeline
In a computer vision pipeline, it is known practice to proceed through modules step by step. Usually in the first module the input data - in our case this is the project sidewalk data - gets pre-processed, for example a set of images get rescaled to adhere to the next module’s requirements. Each operation is implemented as a separate module. In our work, we design and implement a modular and flexible pipeline made of four parts, all visualised in Figure 3.
1. Pre-processing Module
A limitation of panoramic images is that they are spherical images mapped onto a 2D plane, resulting in distortions towards the poles as can be seen in Figure 4 (left). To solve this issue, in this module we change the mapping to cube mapping, projecting the panorama onto the six faces of a cube. This process reduces distortion and retains more detail, as demonstrated in Figure 4 (right).
The final output of this module consists of multiple cube images for each panorama, which are the input of the next module.
2. Unsupervised Object Discovery (UOD) Module
After pre-processing, the images then move on to the next stage of our pipeline - the Unsupervised Object Discovery (UOD) Module. The goal of this module is to identify objects, represented in an image as groups of pixels, and distinguish them from the rest of the image (the background). In a conventional setting, this module is implemented with an object detector that has been trained to recognize a finite number of object classes through supervised learning. In our case, we employ self-supervised techniques, resulting in a model that can ideally recognize any object in any image.
To illustrate this, Figure 5 shows examples of correct outputs of MOVE [6], the model we choose in our experiments. In all examples, the model outputs object masks, which represent distinct groups of pixels corresponding to objects in the image. Specifically, the masks correspond to the point labels, which represent correct cases of outputs. However, our setting presents several challenges. For example, a panorama can contain many objects, which can also be far from the camera. Also, depending on the task, an object can be considered part of the background or not. These limitations hinder the performance of our model, as shown in Figure 6. In these examples, the obstacles indicated by the point labels are not found by the model, or the model confuses parts of the background as if they were obstacles.
3. Semantic Segmentation Module
To improve the object detector performance, we propose to use this module to find urban contextual information. Specifically, we employ the model Grounded-SAM [7] to automatically extract sidewalk masks in the images. In this way, we can guide the object detector to find objects that are only located on sidewalks. We identify two ways to achieve this. In the first one, we crop the input image removing the non-relevant pixels before feeding the image to the object detector. In the second one, we use the urban information to filter the object masks obtained from the previous module.
Figure 7 visualises the implementation of the first case. Assuming that there is a relevant and non-relevant part in each image, we pick the part where there are sidewalks as relevant and filter out all the other pixels in the image. The main goal is to filter out buildings and other background while keeping the ground part intact. By feeding this image to MOVE, the model can focus its attention on less elements of the urban scene and, more importantly, on the relevant ones.
Alternatively, Figure 8 and 9 show two ways to implement the second case. The Semantic Segmentation Module outputs a bounding box and a mask for each detected sidewalk. In the first case, the algorithm discards all masks that do not overlap with the bounding box. In the second case, all masks that are further in pixel distance than a threshold from the sidewalk masks are discarded. The sidewalk mask case results in a stricter version of the bounding box case. In both cases, the filtering helps remove non-relevant objects (and noise) in the model output.
4. Evaluation Module
After applying the Semantic Segmentation and UOD modules, the results are then passed onto our Evaluation Module. This module is designed to compare the effectiveness of the different algorithms and determine the most successful one for our specific use case. It uses a combination of conventional quantitative metrics and crafted qualitative metrics.
Results
In all our experiments, incorporating urban context information in the pipeline improves the quality of the output masks. In Figure 10, a comparison between the baseline algorithm and the one that uses that information reveals that with the prefiltering, the model outputs fewer non-relevant object masks along with less noise. Also, the overall quality of the correct obstacle masks increases. For example, in the first image, the object detector no longer identifies parts of the left building as objects.
To gain a deeper understanding of the effectiveness of our filtering techniques, we conducted a user study involving domain experts in the Amsterdam Intelligence team who have experience in facilitating urban accessibility. The experts evaluate the output of three models (baseline, prefiltering, postfiltering with bounding boxes) by answering three questions:
- Is the object mask located on a sidewalk?
- Does the object mask represent an obstacle on the sidewalk?
- Rate the object mask quality.
The user study evaluates the relevance (Question 1), precision (Question 2), and utility (Question 3) of the predicted masks.
The results of the user study (in Figure 11) confirm that the filtering algorithm effectively decreases the number of object masks that are either not located on sidewalks or do not represent obstacles, and would not be used in production. They also show a slight decrease in the correct number of masks, suggesting that higher precision comes at a price of some reduction in recall. Focusing on the two filtering methods, postfiltering achieves higher precision than prefiltering in Task 1 and Task 2, yet in Task 3 the prefiltering method outputs the most high quality masks. This could be a consideration to choose prefiltering instead of postfiltering as the final method.
While these algorithms improve the performance of the pipeline, the results also reveal that the best-performing algorithms still predict masks of which 40% are not located on a sidewalk. In the mask rating task, the majority of the masks fail the evaluation with respect to their usefulness for actually identifying accessibility.
The primary reason for these shortcomings is associated with the complexity of our task. Urban images involve multiple objects that are related. Conversely, the object detectors trained in a lab-setting use object-centric datasets, with a single object being the focus of the image. This mismatch manifests in the sub-optimal performance of such models when confronted with our complex scenario. Specifically, MOVE has been trained with images collected from object-centric datasets. Coupled with this issue is the failure of Grounded-SAM in segmenting sidewalks, as it often confuses them with roads or fails to localise them entirely. Conventional models trained on segmenting urban scenery from North American panoramas struggle when faced with European ones such as the ones from Amsterdam.
Addressing these problems represents crucial steps toward achieving a usable framework.
Conclusion and Future Work
Given the modular nature of the pipeline, each module offers opportunities for improvements to boost the overall performance of the framework. For example, implementing multi-cropping (Figure 12) for each face of the cube map projection can assist the object detector by focusing its attention on the objects. As a byproduct, objects distant from the camera would appear more prominent, simplifying the detection process.
Moreover, training MOVE on a scene-centric dataset could greatly enhance its ability to locate objects in urban environments. Grounded-SAM could also potentially be used as an object detector for a finite set of classes. Certain accessibility features, such as curb ramps, streetlights, and a limited set of obstacles (e.g., bikes, poles, etc.), could be identified using this method.
Furthermore, there are opportunities for improvement in the Semantic Segmentation module. Techniques like few-shot learning or domain adaptation could be used to fine-tune the semantic segmentation model, leading to better segmentation of urban scenes.
Despite these proposed improvements, it's important to remember that the task at hand is challenging. Extensive experimentation may still reveal that the task's complexity and high-performance requirements outpace the capabilities of unsupervised techniques. A potential middle ground between standard supervised learning and the unsupervised learning frameworks we have explored could be weakly supervised learning [8].
References
[1] Winnie Hu. For the disabled, New York's sidewalks are an obstacle course. The New York Times, Oct 2017.
[2] Manaswi Saha, Michael Saugstad, Hanuma Maddali, Aileen Zeng, Ryan Holland, Steven Bower, Aditya Dash, Sage Chen, Anthony Li, Kotaro Hara, and Jon Froehlich. Project sidewalk: A web-based crowdsourcing tool for collecting sidewalk accessibility data at scale. 2019.
[3] Kotaro Hara, Jin Sun, Robert Moore, David Jacobs, and Jon Froehlich. Tohme: Detecting curb ramps in google street view using crowdsourcing, computer vision, and machine learning. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, UIST ’14, page 189–204, New York, NY, USA, 2014. Association for Computing Machinery.
[4] Maryam Hosseini, Mikey Saugstad, Fabio Miranda, Andres Sevtsuk, Claudio T. Silva, and Jon E. Froehlich.Towards global-scale crowd+ai techniques to map and assess sidewalks for people with disabilities. InCVPR2022 Workshop: Accessibility, Vision, and Autonomy (AVA), 2022.
[5] Galen Weld, Esther Jang, Anthony Li, Aileen Zeng, Kurtis Heimerl, and Jon E. Froehlich. Deep learning for automatically detecting sidewalk accessibility problems using streetscape imagery. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), 2019.
[6] Adam Bielski and Paolo Favaro. Move: Unsupervised movable object segmentation and detection, 2022.
[7] IDEA-Research. Grounded-segment-anything. https://github.com/IDEA-Research/Grounded-Segment-Anything, 2023.
[8] Zhi-Hua Zhou. A brief introduction to weakly supervised learning. National Science Review, 5(1):44–53, 2017.
* Cover image: Example of a panorama with sidewalks obstructed by bikes and bollards ("Amsterdammertje").