Reference: Selma Muhammad, 2022. The Fairness Handbook. Gemeente Amsterdam (2022).
The Fairness Handbook originates from the website of Amsterdam Intelligence.
Decisions and outcomes generated by algorithms can lead to discrimination against individuals and demographic groups when the algorithms are not built and deployed properly. However, monitoring and assessing the fairness of models is a daunting task, requiring a multidisciplinary collaboration between various stakeholders, including data scientists, domain experts and end users.
To provide a practical set of instruments for assessing the fairness of a model and minimizing algorithmic harms that affect citizens, Amsterdam Intelligence recently released the Fairness Handbook. This book provides an introduction to algorithmic fairness and bias for everyone whose work involves data and/or algorithms. It explains how biases and other problems in the model development cycle can cause several forms of harms that consequently impact individuals or disadvantaged groups in society. With the Fairness Pipeline, we then offer a step-by-step plan to evaluate a model for biases and to mitigate these problems.
Reference: Selma Muhammad, 2022. The Fairness Handbook. Gemeente Amsterdam (2022).
The Fairness Handbook originates from the website of Amsterdam Intelligence.
The city of Amsterdam is committed to leveraging the benefits of data, but because of the responsibilities of the organization, municipal data products often involve sensitive data and can have a serious impact on citizens. It’s therefore of the utmost importance that data is used in a way that is ethical and fair towards all citizens.
This is especially important when using machine learning. Machine learning makes use of data by learning through the generalization of examples, defining rules that apply for past cases, but also predicting future unseen cases. This technique can be very useful to make data-driven decisions, uncovering relevant factors that humans might overlook, but it doesn’t ensure fair decisions.
Over the last few years, many problematic examples of machine learning made headlines, showing that it does not at all guarantee fair decisions.
At the municipality, one application of machine learning is a model that helps counteract illegal holiday rentals, freeing up valuable living space for Amsterdam citizens. In this context, we need to make sure that different groups of people are treated fairly when a suspicion of illegal holiday renting at an address arises.
In this blogpost, we’ll share with you how we went about analyzing the model for biases[¹] with the objective to prevent harmful, unjustified, and unacceptable disparities. This entails estimating the magnitude and direction of biases present in the model, understanding them, and mitigating where necessary.
The methods described in this blogpost represent how we translated the abundance of theory on bias in machine learning to practical implementation. There’s unfortunately not (yet) one perfect way to handle bias in machine learning, but hopefully, this example will inspire you and make it just a little bit easier to carry out a bias analysis yourself.
This is part 1 of three blogposts, in which we’ll start by explaining some important concepts regarding bias in machine learning. If you already feel confident in that area, stay tuned for part 2 and part 3, which cover the practical details of our methodology and the results.
As we showed recently, bias in the outcomes of a model may arise in different ways. The main one is through bias in the data used to train a model. This may represent bias in the actual underlying process, or it may get introduced when data is collected, while the underlying process itself is completely unbiased - or both.
To understand this a little better, it’s useful to look at some types of bias (by no means an exhaustive list):
It can also be useful to consider bias at the modeling stage, which is commonly called inductive bias. These are the assumptions one must make in order to train a generalizable machine learning model. They may vary depending on the algorithm that is used.
Two principal doctrines exist in (US) discrimination law: disparate treatment and disparate impact. Although they are concepts from US law, they provide a useful way of thinking about bias.
Disparate treatment happens when an individual is treated less favorably than others for directly discriminatory reasons. It may occur in a model if the discriminatory category is used as a feature. This could be either directly, or indirectly but intentionally through correlations with other, seemingly neutral features (we will dive into more details in the next sections).
Disparate impact happens when seemingly neutral systems or practices are in place that, unintentionally, disproportionately hurt individuals who belong to a legally protected class. It could be introduced into a model by using features that seem non-discriminating, but still, disadvantage some protected groups.
What clearly separates disparate impact from disparate treatment is the absence of intention: selecting new colleagues based on their ability to comprehend the intricacies of Dutch municipality bureaucracy could be classified as disparate treatment if we do it intentionally to make it harder for non Dutch persons to join the municipality, or as disparate impact if we overlooked the fact that this test might be easier for a native Dutch speaker[²].
To check disparate impact, the "four-fifths" rule is often used: in historical cases, a 20%-difference between two groups has been enough to conclude that there's discrimination.
Individual and group fairness are closely related to disparate treatment and disparate impact.
Individual fairness states that similar individuals should be treated similarly. This means striving for procedural fairness and equality of opportunity: people with the same relevant characteristics should get the same outcome.
Group fairness, on the other hand, states that outcomes should be equal on a group level. This means striving for distributive justice and minimal inequity: good and bad outcomes should be equally divided among protected groups, even if those groups may differ in relevant characteristics. The logic behind this is that sometimes, differences in relevant characteristics are caused by historical injustices, so outcomes should be distributed equally to prevent those historical injustices from having a lasting impact. Caution is required though, because it could have an adverse impact on a protected group if, for example, loans are given to people with lower creditworthiness, and they go bankrupt because of it.
Note how these two concepts are not necessarily compatible: to make the outcomes fair on a group level, we might need to treat individuals from a disadvantaged group more favourably than those from an advantaged group.
Roughly speaking, group fairness is the way to go if one believes that differences in characteristics and outcomes between groups are the result of historical injustices (like discrimination) that an individual cannot do anything about. Individual fairness is the way to go if one believes that the characteristics and outcomes of a person have nothing to do with their group but are the product solely of their own choices and actions.
Based on that, we can then choose to either:
Direct bias means that a model discriminates by using a sensitive attribute, such as membership of a protected group, as a feature. The model can thus directly see which group subjects belong to, and directly learn any correlation between the sensitive feature and the target. The solution to direct bias is to remove sensitive attributes from the model.
Indirect bias (also called bias by proxy) happens when a non-sensitive feature is correlated with a sensitive attribute. By using this proxy variable as a feature, the model may still learn to discriminate between groups. An example that holds true in many places is that postcode is often a proxy for nationality: postcode is not inherently sensitive, but in most cities, it does have a strong relation to the nationality of inhabitants.
Indirect bias is also sometimes called "redlining", after the past practice of US credit providers to circle certain neighborhoods on a map with a red pencil to mark them as areas they would not serve. This way, they deliberately reduced the number of black people in their customer base.
Looking at the potential sources of bias, there are a few natural ways to counter it: by changing the underlying process, changing the way the dataset is collected, or adjusting the algorithm.
Furthermore, there are three stages in the modeling process where it is possible to intervene using mitigation techniques and reduce the bias:
Since mitigation techniques alone could easily fill a blogpost and we ended up not needing them for our model, we won’t go into more detail here.
However, in terms of practical implementation, it is good to know that many mitigation methods and fairness metrics are implemented (and described) in the AI Fairness 360 toolkit by IBM.
In this first part, we iterated over important concepts of bias in machine learning models, explaining the theory behind the decisions that must be made during a bias analysis. The purpose of this part is to provide a good starting point for further study and to know the basic concepts needed to understand the second part.
In the second part, we are going to get practical and look at the actual methodology that we used in our model, in order to ensure that different groups of people are treated fairly when suspicion of illegal holiday renting at an address arises.
We hope that reading both parts will inspire and help you to carry a bias analysis yourself.
Footnotes
[1] Bias is a term that can mean very different things to different people. In this blogpost, we’ll use the term loosely to describe any “harmful, unjustified, and unacceptable disparities” in the outcomes of a model.
[2] Disregarding for a moment the fact that navigating bureaucracy is indeed sometimes an indispensable skill here.
Auteurs: Sebastian Davrieux, Meeke Roet, Swaan Dekkers & Bart de Visser
Dit artikel is afkomstig van: Part 1: Concepts: Analyzing Bias in Machine Learning: a step-by-step approach (amsterdamintelligence.com)
This is the second part of three blog posts regarding the work we have been doing at the city of Amsterdam, focusing on bias in machine learning. If you have made it this far after having read part 1, you should now have a decent understanding of some important concepts regarding bias in machine learning, and why it’s important to the municipality. Or perhaps you jumped straight here, because you already knew all that, in which case: good for you and welcome!
Recall that in these blog posts, we are concerned with making sure that different groups of people are treated fairly by our machine learning model when a suspicion of illegal holiday renting at their address arises. In part 1 we explained why this is important to the city of Amsterdam and dove into the theory around bias in machine learning. In part 2, we will start getting practical and look at the actual methodology we used. There is unfortunately not (yet) one perfect way to handle bias in machine learning, but hopefully, this example will inspire you and make it just a little bit easier to carry out a bias analysis yourself.
Suppose that citizens are renting out their apartment to tourists. Since the city of Amsterdam suffers from a serious housing shortage, they are only allowed to rent out their apartment for a maximum of 30 nights per year, to at most 4 people at a time, and they must communicate it to the municipality. When a rental platform or a neighbor suspects that one of those requirements is not fulfilled, they can report the address to the municipality. The department of Surveillance & Enforcemen can then open a case to investigate it.
The subject of this blogpost is the machine learning model that supports the department of Surveillance & Enforcement in prioritizing cases, so that the limited enforcement capacity can be used efficiently. To this extent, the model estimates a probability of illegal holiday rental at an address.This information is added to the case and shown to the employee, along with the main factors driving the outputted probability. This helps to ensure that the human stays in the lead, because if the employee thinks the main drivers are nonsense, they will ignore the model’s advice.
At the same time, if they do start an investigation, it will help them to know what to pay attention to. A holistic look at the case (including the model’s output, but also the details of the case and requests from enforcers or other civil servants) then leads to the decision to pay the suspected citizen an investigative visit or not. After the investigation, supervisors and enforcers together judge whether there was indeed an illegal holiday rental or not. The purpose of the model is thus to support the department in prioritizing and selecting cases, while humans still do the research and make the decisions. It is good to note that in principle, all cases do eventually get investigated.
Our goal is to analyze bias in the outcomes of the trained model, specifically: the probability of illegal holiday rental assigned to a case by the model. As we saw earlier, two sources of such bias are the data and the algorithm itself. This naturally leads to two ways of doing the analysis, namely by analyzing the data and/or the algorithm. A third option is to analyze bias in the outcomes. This is less complicated, because it is well-scoped and the outcomes are readily available on the train and test set. Since we're looking to investigate specifically the potential discriminatory effects of using the model in practice, and not those of the entire working process, this is the option we decided to go with. This does mean that it is not possible to exactly pinpoint the causes of a bias if we find one, although you can often make an educated guess about the ‘why’ in hindsight.
Having sorted that out, we come to the actual analysis, which can be divided roughly into 8 steps:
To be able to do the analysis, we need to think about which attributes we do not want the model to discriminate on. For machine learning within the municipality, the following sensitive attributes can be considered relevant (in various degrees depending on the application):
We need information about these attributes to analyze them. However, not all of them are registered by the municipality or even by the government in general, in many cases for good reason. We mention the complete list here anyway to be aware of potential blind spots in the analysis. The attributes that were available within the municipality had never been used for the purpose of a bias analysis before. For that reason, we first defined the legal grounds on which this sensitive data could be processed .
As we know, some seemingly-innocent features may be correlated with a sensitive attribute, causing indirect bias. The second step in the analysis is to brainstorm about such relations. We systematically went through our list of features and were very liberal at this step: if we could come up with any logic why a feature could potentially cause indirect bias, we included it as a hypothesis. This resulted in a list of hypotheses about correlations between features and sensitive attributes.
We are not supposed to say that one step is more important than the others but selecting the metric(s) carefully is quite crucial. Bias is not a one-dimensional concept and multiple metrics are often needed to get a full overview.
We have already seen that individual fairness and group fairness are very different concepts, and within those two streams, there are still many more fine-grained metrics. At the same time, it is impossible to improve all metrics simultaneously. For that reason, we decided to look at a few metrics, and select one main metric to guide our decision-making in the analysis.
The “fairness tree” below was of great help in selecting the primary metric. Studying the metrics present and the ones not present in the tree helped to understand the implications of the choice better. Other helpful references were the documentation of AIF360 and Aequitas.
Exploring the tree, we decided that:
These considerations led us to the false-positive rate parity as the primary focus, implemented in the false-positive rate ratio. Many metrics can be looked at either as a difference or a ratio, but they convey the same information. In hindsight, it’s probably easier to use the difference. To understand this specific metric let’s look at the false-positive rate.
The false-positive rate (FPR) is calculated as the ratio between the number of negative events wrongly categorized as positive (false positives) and the total number of actual negative events (regardless of classification):
FPR = FP / N
So, what does false-positive rate parity mean for two citizens that get reported by their neighbor? If they’re both on the list of potential cases, but neither has an illegal holiday rental, then both should have the same probability of getting (wrongly) investigated.
The false-positive rate ratio (FPRR) is then defined as:
FPRR = FPRA / FPRB
Using the four-fifths rule (explained in the first blog post), it can be interpreted as follows: a FPRR
A useful way to summarize the FPRR over a complete model and dataset is the average FPRR distance, or in fact the distance for any fairness metric of your liking. We define this as the average distance of the fairness metric to the unbiased value of that metric, calculated in the following way:
For example, an average FPRR distance of 0.2 means that on average, across all sensitive groups, the FPRR is 0.2 below the unbiased value 1.
With the preparations behind us, it is time to do the actual analysis. We are combining three steps in this part since they follow the same principle. The metric we have selected, and pretty much every other common metric out there, works by comparing groups. To put it simply: we calculate some metrics on group A, do the same thing for group B, and see if there’s a difference. Based on the presence and size of that difference, we draw conclusions.
The attribute or feature values need to be split into two groups, which, from now on, we will call ‘privileged’ and ‘unprivileged’. Note that the naming ‘privileged’ and ‘unprivileged’ is slightly misleading because, in fact, bias against either group will be identified. However, we’ll follow the terminology of AIF360 here. The easiest way to understand how to perform a split is through an example using the sensitive attribute sex. The most common discrimination in society for the attribute sex is against women. So, if the featured sex would be present in the dataset, we could split it like:
In the case of continuous variables, we can set a threshold, so that all subjects with a value below the threshold will fall in one group, and all subjects with a value above it in the other. If the desired groups are not contiguous, we could of course also create a boolean to indicate the groups and split on that. For features without obvious groups, we often found it useful to check the distribution of the feature values to see if any “natural” groups stood out.
In any case, splitting the groups is not straightforward; it’s a subjective decision that should be taken consciously. For example, definitions of western and non-western countries are debatable, and a split that's suitable for one model may not capture the right differences in another context.
Now that we know how to create groups, let's untangle again those three steps that we combined:
First, we want to analyze features used by the model that directly map a sensitive attribute. This will tell us if our model has a direct bias. Examples are any features involving sex or age, two sensitive attributes that are present in the dataset used to develop the model. From the beginning, we decided together with the business that we would use these features if and only if no bias was identified and their importance was high.
Second, we analyze groups based on the sensitive attributes that we have available, but that are not used by the model like features. This is the best way to measure indirect bias. An example is nationality: we know the nationalities of occupants of an address, but it has never been considered as a feature. If after training the model without this feature we see that our metric doesn’t differ between groups based on nationality, then we can conclude that there’s no indirect bias on nationality. Based on this result we don’t need to further consider any hypotheses we had involving an indirect bias on nationality.
Since the previous step has hopefully slimmed down the list of hypotheses about features leading to indirect bias, the third step is to check only the ones left. This can be done by focusing on the groups identified in the features that we suppose carrying the indirect bias. Out of the three steps described, this is the most difficult to handle if you do find a bias gap between groups (which is the reason why we do it last). We must carefully consider how likely it is that the gap translates to an actual indirect bias, as per the hypothesis. Ideas are to consider the size of the gap and to look for evidence of how strong the hypothesized correlation is, for example in scientific literature.
Although this blog post focuses primarily on the technical side of analyzing bias, creating fair and ethical models involves much more. We see our job as data scientists in this process as providing the business stakeholders with the required knowledge, information, and advice to make an informed decision about “their” model. Part of that is doing a bias analysis.
By this point, we have obtained a bunch of results about the biases present (or hopefully: not present) in our model. Those results need to be discussed with the stakeholders so that they understand what’s going on and can decide if there are any biases that need to be mitigated. Even though that’s written here as step 7 out of 8, we in fact already involved our stakeholders in almost all the decisions we took to get here: the sensitive attributes, the list of features to check, how to group them, which fairness metric to look at, and so on. All these were verified by or decided together with them.
If a bias is found and needs to be mitigated, we must decide how to do so. There are three main methodologies of bias mitigation:
During our research, we decided to focus on pre-processing algorithms, because they reduce bias already at training time, and the earlier the better. The chosen algorithm was the reweighing technique. We will not go into detail on how the algorithm works because the bias could be mitigated in other ways.
In this second part, we discussed our methodology and analysis. We explained how to decide the attributes that needed to be investigated, constructing hypotheses about which features could cause indirect bias. We selected the suitable metrics for our project and explained how to perform the three analysis stages. In the third part, we will discuss the findings and conclusions. We hope that reading all parts will inspire and help you to carry out a bias analysis yourself!
Auteurs: Sebastian Davrieux. Meeke Roet, Swaan Dekkers & Bart de Visser
Dit artikel is afkomstig van: Part 2: Methodology: Analyzing Bias in Machine Learning: a step-by-step approach (amsterdamintelligence.com)
This is the third part of three two blog posts regarding the work we have been doing at the city of Amsterdam, focusing on bias in machine learning. In part 3, we’ll discuss the results, conclusion, and future considerations. If you’ve made it this far after having read part 1 and part 2, you should now have a decent understanding of some important concepts regarding bias in machine learning and the related methodology. Recall that in these blog posts, we are concerned with ensuring that different groups of people are treated fairly by our machine learning model when a suspicion of illegal holiday renting at their address arises.
The data was split into two parts: a training set and a test set. There are multiple ways of performing this split. In this project, we used the following two:
When the bias analysis started, the model was trained using a temporal split, and 20% of the data was left as the test set. This is the most natural way of splitting the data since the model in production will predict recent cases. The results were surprisingly biased in almost all the investigated attributes, with an average FPRR distance of 0.40 from the non-bias value. Such a negative result was suspiciously bad, so we further investigated it.
We analyzed the distribution of the values of all the features used by the model, comparing the training and the test set. This revealed that the feature generated by counting how often an address was reported by citizens contained some zero values in the training set, but never in the test set. This means that all the cases in the test set were opened when a report was received, while in the past, cases were also opened without a report, using other processes.
We suspected that this was the reason why our results were so biased in every aspect. To confirm this hypothesis, we splitted the data once again, this time using the shuffle technique. The idea is that the shuffle technique, mixing old and new cases, will create a new test set containing cases with and without reports.
One could object that this could lead to fake better performances because the model would learn information from the future and use that to classify a case of the past. This concern was addressed by checking if the model had the same performance using both splitting techniques, which was indeed the case.
The bias analysis of the model trained over the data split using the shuffle technique performed noticeably better, reducing the average false positive rate ratio distance from 0.40 to 0.16. This result is a big hint that our intuition was correct.
To confirm that the shuffling technique was not what reduced the bias (but the presence of cases without any report in the dataset) we did a test removing from the shuffled data the cases where the number of reports was equal to zero. The result was that the model contained a high bias, achieving an average FPRR distance of 0.31.
From this analysis, we can conclude that it’s important to create cases using different processes than just by relying on reports from the citizens. These extra cases are necessary to reduce the bias of the investigations. These different processes stopped temporarily during the coronavirus period, because of the influence that the virus had on the hospitality industry and to protect investigators and investigated against infection. At this moment the processes resumed and they are generating non-report cases again. For the analysis of the features, the shuffle split test set will be used.
For direct bias, we had a few features that could directly map one of two sensitive attributes sex and age.
Gender is used to describe the characteristics of women and men that are socially constructed, while sex refers to those that are biologically determined. Most people are born as what we define female or male, and learn to be girls and boys who grow into women and men. This learned behavior makes up gender identity and determines gender roles. We analyzed only female and male, and no other sexes, since these were the only sexes present in the data. However, if in the future this information will be available we will consider other sexes as well.
The analysis over sex was done by analyzing the features directly derived from it. The split was done by setting as the privileged group if no women were living at the address, and the unprivileged if at least 1 woman was living at the address, mapping potential discrimination against women.
The analysis revealed discrimination against the privileged group. So, instead of discrimination against women, in this case men were discriminated against. Given our main error metric, this means that the error made by the model on cases with only male occupants was bigger than the error on cases with at least one female occupant.
The outcome of this analysis combined with the low feature importance led to the removal of the feature itself. After removing the feature from the model, the bias was checked again, and it was successfully mitigated.
The bias on age was checked using features derived from age, such as the average age of the occupants. Since this is not a boolean feature like sex and it also doesn’t have a logical threshold, the decision of where to split it was taken by analyzing the distribution of the feature.
The split chosen was around the 4th quartile, with the purpose of dividing the 75% younger part of the population from the 25% older. The false positive rate ratio for the average age of occupants was 1.11. Since the result is between 0.80 and 1.25, it cannot be considered a bias against age. To be sure, we analyzed a few more split values, which all resulted in a non-bias score. Nevertheless, since age is a sensitive attribute and the feature importance was not very high, we decided to remove the age features.
To analyze the indirect bias by the underlying attributes, we defined the legal grounds to have access to the Basisregistratie Personen (BRP). This made it possible to retrieve the sensitive attributes of nationality, country of birth, and civil status for the purpose of the bias analysis (never for training).
The analyses of nationality and country of birth were both done the same way. To avoid describing the same steps and logic multiple times, here we'll explain our approach with nationality, but the same was done for the country of birth.
The first split we created was between addresses where all the occupants were Dutch and addresses where this wasn’t the case, to discover any discrimination against Dutch or non-Dutch families. However, discrimination is not only based on being Dutch / non-Dutch, there is also broader discrimination against people who are from specific countries. To investigate this kind of discrimination the country list was divided into two groups: western countries and non-western countries. We followed the definition of the Dutch national bureau of statistics (CBS) in this: in their statistics, countries in Africa, Latin America and Asia (excluding Indonesia and Japan), and Turkey are counted as non-western. Based on this, we created a second split, mapping at which addresses all the occupants had a western nationality.
An extra check was made extending the idea of the previous two features but using a ratio instead of a boolean value, for instance the ratio of people of western nationality relative to the total number of occupants. The split between the privileged and the unprivileged group was done on a ratio of 0.5, identifying in this case the addresses where at least half the occupants were of western nationality.The results are as follows:
Feature |
False positive rate ratio |
All Dutch nationality vs. not all Dutch |
0.95 |
All western nationality vs. not all western |
0.90 |
All born in the Netherlands vs. not all born in the Netherlands |
0.90 |
All born in western country vs. not all born in western country |
0.88 |
Ratio of Dutch nationality occupants relative to total number of occupants |
0.97 |
Ratio of western nationality occupants relative to total number of occupants |
0.90 |
Ratio of occupants born in the Netherlands relative to total number of occupants |
0.95 |
Ratio of occupants born in western country relative to total number of occupants |
0.90 |
Since all the results are between 0.80 and 1.25, the outcomes of the model cannot be considered biased against nationality and country of birth attributes.
Civil status indicates if someone is for example married, divorced or single. It’s a categorical column with multiple possible values. We mapped each of these categories to a feature indicating the ratio of how many occupants at an address belong to that category divided by the total number of occupants. The groups split was set on the value 0.5.
The categories are the following:
This is the result of the analysis:
Feature |
False positive rate ratio |
Ratio of unmarried occupants relative to total number of occupants |
0.90 |
Ratio of married occupants relative to total number of occupants |
0.90 |
Ratio of occupants with a registered partnership relative to total number of occupants |
- |
Ratio of occupants divorced after marriage relative to total number of occupants |
1.13 |
Ratio of occupants divorced after registered partnership relative to total number of occupants |
1.34 |
Ratio of occupants widowed after marriage relative to total number of occupants |
0.86 |
Ratio of occupants widowed after registered partnership relative to total number of occupants |
1.72 |
Ratio of occupants with unknown civil status relative to total number of occupants |
0.95 |
Three groups that deserve more investigation:
The first one was not scored at all, because there was actually no variety in this feature. The other two have only very few (1 or 2) observations with a value different from zero, leading to high, but also unreliable metrics. In hindsight, they should’ve been grouped together with married, divorced, and widowed respectively, since marriage and a registered partnership are virtually equal for the law in the Netherlands. Because of this similarity, and because no bias was found for the marriage-related civil statuses, we think it is safe to assume this also isn’t a problem for the categories related to a registered partnership. Of the other categories, none can be considered biased.
The bias on the following sensitive attributes was not measured since the data was not available:
The remaining indirect bias hypotheses were analyzed by splitting on the features involved. The resulting fairness metrics were always between 0.8 and 1.25, so in an acceptable range.
Altogether, direct and indirect bias were analyzed, obtaining useful insights regarding the model and the data used by it. The first important outcome is a confirmation of the intuition that it’s important to create cases using different processes, not only relying on reports from the citizens. The analysis proves that these extra cases are necessary to reduce the bias of the investigations. Since the creation of cases using other methods has resumed, these cases merged with the cases opened after a report is received means that overall, the error made by the model that cannot be considered biased.The second important outcome is that men were discriminated against when the model was trained using sex as a feature. The error made by the model over cases with only male occupants at the address was bigger than the error when there were also, or only, female occupants. This outcome, combined with the low feature importance, led to the removal of the feature sex.
The purpose of our bias analysis was to ensure that our holiday rental model is not biased. For that reason, we focused on detecting bias in the model outcomes and not in the underlying data. Since we did obtain some results suggesting bias in the underlying data, it would be interesting in the future to analyze the cases opened as a result of citizens’ reports to better understand the bias that those reports carry.
It would also be interesting to investigate what can be done to better analyze the bias on sensitive attributes that are not available within the municipality. We did take the missing attributes into account by including them when making hypotheses of correlations that lead to indirect bias, but this is obviously not ideal. For instance, we could look for information about the missing attributes in other places, such as at the CBS, or see if we can approximate them by measuring other, correlated variables. On the other hand, some sensitive attributes are not available to the municipality (or anywhere for that matter) for very good reason and an argument can be made that we should not try to get or approximate them even for the purpose of a bias analysis.This is a broader discussion that has not concluded yet.
The analysis is limited by the choices that were made for the division of cases into privileged/unprivileged groups. For example, we cannot be sure that no bias would have been found if the groups had been created differently. We deem this risk small, since the groups were carefully crafted based on theory and the distributions seen in the data. Moreover, multiple variants were tested for the features that did not have clear splits. However, since the groups were all based on one feature, something we cannot rule out is that bias is present in a subgroup of cases that is not defined by just one feature, but by a combination of features. Another point to note is that the groups are influenced by the assumptions of the people making them. That’s why it’s important to carefully discuss them with a diverse group of people. Also, work on alternative methods that do not require groups to be specified upfront is being done.
As for the fairness metric, selecting one main metric proved to be essential to make the analysis feasible, but the results should be interpreted accordingly; we cannot make definite statements about the fairness of the model according to other metrics. In our case, the type of error with the worst impact for a citizen was rather clear and the choice for a main metric followed from that easily. If that’s not the case, focusing so strongly on one metric becomes more problematic.
Although rooted in historical studies, the use of the four-fifths rule is in the end a subjective decision. For applications with a particularly high impact, it may be too high. Likewise, uncertainty about the true values of our metrics due to small sample sizes could prompt tighter bounds, since a metric that falls just within the bounds may in reality be much larger. In this project, we concluded that the four-fifths rule should be suitable. An alternative could perhaps be to estimate the standard deviation of the fairness metric by bootstrapping, and then seeing if the non-bias value of the fairness metric lies within a certain confidence interval.
All in all, the methods described here are not perfect and still under active development. Furthermore, they should be embedded in a broader framework of (legal, ethical, etc.) fairness interventions throughout the project. This is therefore not a clear-cut solution to all bias problems, but rather an example of what a bias analysis can look like in practice, from a technical standpoint.There’s an abundance of theory on bias in machine learning, but practical examples are hard to find. That’s a shame, because the easier it is to deal with bias, the more data scientists will do it. We hope that this piece has contributed to that.
Auteurs: Sebastian Davrieux, Meeke Roet, Swaan Dekkers & Bart de Visser
Dit artikel is afkomstig van: Part 3: Analysis and conclusion: Analyzing Bias in Machine Learning: a step-by-step approach (amsterdamintelligence.com)
As Amsterdam is the most populous city of the Netherlands by housing nearly one million citizens, we increasingly use AI technologies to improve the speed and quality of the complex tasks that affect this large population. However, to ensure that these improvements are felt by each citizen and population group equally, we pay a lot of attention to investigate the fairness of algorithms. In this blogpost, we dive into Hierarchical Bias-Aware Clustering, which is a new method to detect biases which cause the algorithm to treat groups of persons differently. This method can be added to the City’s growing toolkit of instruments that are used to assess the fairness and inclusiveness of algorithms and to eventually ensure that all the algorithms used by the City are ethical.
Algorithmic fairness relates to the absence of any prejudice or favouritism toward an individual or a group based on their inherent or acquired characteristics, such as ethnicity, age and gender [1]. When an algorithm produces skewed outcomes based on these sensitive attributes, it is said to contain undesired bias. As Figure 1 shows, this bias can be found at multiple stages of the algorithm development cycle.
In the past few years, several projects were set up to research the fairness of Amsterdam’s algorithms by finding, mitigating and preventing undesired bias. For instance, Rik Helwegen looked into using causality for training fair models and enforcing counterfactual fairness [2], Tim Smit researched the construction of the needed causal graphs. and Joosje Goedhart investigated the cost of group fairness when compared to individual fairness.
However, the studied methods required much domain expertise from civil servants and developers: for many fairness instruments, the sensitive attributes in the dataset often needed to be pre-specified. Besides the pre-specification of sensitive attributes, some fairness instruments require civil servants to predefine the vulnerable groups in the dataset on which an algorithm has more potential to produce skewed outcomes. Moreover, since the City uses a wide range of different algorithms, a fairness method should be able to cater to these diverse models. Therefore, to reduce the need for domain expertise and to cater a more generalizable fairness instrument, we propose the method Hierarchical Bias-Aware Clustering (HBAC) which is inspired by a study conducted by Misztal-Radecka and Indurkhya [3], who developed a bias-aware hierarchical clustering model to automatically detect discriminated groups of users from recommendation algorithms.
The main rationale behind the Hierarchical Bias Clustering method is to use the errors produced by the algorithms as a light in the dark to find bias. As no algorithm is entirely error-proof, error patterns give insight into how an algorithm can be skewed towards certain groups. These errors can be categorised into False Positives and False Negatives for classification algorithms, which are shown on the confusion matrix below. With these errors, we automatically detect groups for which an algorithm is producing substantially more errors, hereby indicating discrimination bias, or few errors, which highlights a group favoured by the algorithm. The formation of groups based on these error discrepancies is handled by a clustering algorithm.
The clustering algorithm is applied on the results obtained from the algorithm of which we want to investigate its fairness. The main task of the clustering model is to find a natural grouping among data points that, ideally, lead to meaningful or useful groups. In our case, we want to find clusters of persons sharing similar characteristics, such a similar age or the same gender, and of which the cluster has a high discrimination bias, hereby potentially indicating an unbalanced dataset or a bias in the algorithm’s objective. A high discrimination bias implies that the algorithm produced substantially more errors for this group of persons when compared to all the other groups. We could also use the bias to even find a favouring bias, that is, a group of persons for which the algorithm produces considerably less errors. To find these high discrimination or favouring biases, we added the errors as a new attribute in the dataset. However, since we wanted to prevent the situation of overusing the errors and ending up with clusters containing persons who shared no similarities except for the presence or absence of errors, we experimented with scaling methods for the error feature. This error-scaling trade-off is illustrated in the Figure below.
Now that we studied the key components of HBAC, it is time to delve into the methodology for this fairness instrument. Essentially, the methodology comprises three consecutive steps:
After scaling and preprocessing the dataset, all instances are placed in the same cluster. Then, we split this cluster hierarchically into smaller clusters with any standard clustering algorithm. During this study, we experimented with K-Means, DBSCAN and MeanShift. Then, we calculate the performance (or: error rate) of the classifier on each cluster using a performance metric, such as Accuracy, Precision or Recall. This evaluation metric of the cluster is then used to calculate the bias, which is formalized as follows:
where MG indicates the performance of a group G and M\G represents the remaining clusters.
Based on this bias definition, we state that a cluster has a discrimination bias when the bias is smaller than zero, as the classifier produced substantially more errors for this cluster when compared to the other groups and therefore has a low(er) performance. Contrarily, a favouring bias occurs when the bias is higher than zero.
After calculating the discriminating bias for each of the clusters, we compare these biases with each other to determine which of these clusters will be split into new clusters during the next iteration. Besides the discrimination bias, we also use criteria such as a minimal splittable cluster size to select a cluster with a sufficiently large size, since we want to find meaningful groups of persons for whom the classifier is potentially discriminating against.
Figure 5 shows an example of how HBAC identified clusters with a high discrimination bias on a synthetic dataset, where we manually inserted clusters with higher error densities. As depicted on Figure 5, HBAC-DBSCAN managed to find these dense regions of errors. Besides our experiments on two synthetic datasets, we also applied HBAC on COMPAS and another real-world dataset. COMPAS is a dataset used by the US courtrooms to predict the likelihood of defendants to commit another crime. We discovered that DBSCAN succeeded in finding the highest discrimination bias on the synthetic dataset, whereas K-Means was most effective in identifying the highest biases on the real-world datasets.
After identifying the clusters with the highest discrimination bias, we are interested in describing the persons within these clusters. Using those descriptions and analyses on the discriminated groups, we can understand which citizens might be potentially affected by the misclassifications, and we could then use this information to retrace why the algorithm underperformed on these persons. This helps with developing effective counteractive mechanisms that mitigate the algorithm’s discriminating or favouring behaviour. Multiple visualisations can be used to highlight and compare the persons in the discriminated cluster and in the remaining groups. In our current research, we used parallel coordinate (Fig 6) and density distribution plots (Fig 7), to describe the persons in the cluster with the highest discrimination bias. With a parallel coordinate plot, we compare the average values for each feature between the discriminated cluster and the remaining clusters. The parallel coordinate plot on Figure 6, for example, shows that the discriminated cluster contained more persons who had an African-American ethnicity when compared to the other clusters. On the other hand, density distribution plots, such as the one displayed on Figure 7, are used to compare the distributions of the discriminated and the remaining clusters for each feature separately. These visualisations are of key importance for civil servants who want to become aware of which groups an algorithm is potentially discriminating against.
To further refine the Hierarchical Bias-Aware Clustering method, we experimented with different error scaling factors, clustering algorithms and visualisation techniques on two real-world and two synthetic datasets. Based on our observations, we found that K-Means was the most suitable clustering algorithm for HBAC, since it managed to find the highest discrimination bias in three of the four datasets. Moreover, HBAC-KMeans also scored highest in terms of scalability and understandability, which were the desired properties of a bias discovery method for the City. Nevertheless, our experiments with DBSCAN and MeanShift provided us with new insights about what kind of datasets are suitable for which kind of clustering technique, since these density-based clustering algorithms performed better on the synthetic datasets than K-Means did. This result could be attributed to the nature of the synthetic datasets: they contained clusters of “errors” within a larger cluster, which DBSCAN and MeanShift were able to pick up on due to their ability to form clusters based on shifts in data densities.
Although HBAC showed promising results in terms of automatically detecting groups with a high discrimination bias, more work is required to increase the scalability and generalizability of this fairness discovery instrument. This can be done by adding more error metrics besides accuracy which can be used to calculate the bias. Additionally, to make HBAC applicable on the results of regression algorithms instead of only classification algorithms, we could use the residuals as an error metric, hereby allowing a wider range of algorithms to be examined for bias. Besides implementing more error metrics, more attention should be paid to develop intuitive visualisations that support civil servants in investigating the fairness of their algorithms. After all, it is precisely these visualisations that highlight the groups who are treated differently by an algorithm.
The findings on this bias discovery method, although preliminary, suggest that HBAC can support developers, civil servants and other stakeholders with further investigating the presence of the more complex biases in classification algorithms, of which it is difficult to pre-specify the sensitive attributes or vulnerable groups. Using HBAC, the City of Amsterdam can become more open and transparent towards citizens about the presence of undesired bias in AI technologies. For example, the visualisations could be used by the City to report publicly on how fairness is evaluated in Amsterdam’s Algorithm Register. Ultimately, we want to use HBAC to discover bias, after which we can use other fairness instruments to mitigate and prevent this bias from potentially harming a citizen.
For more information about fairness in general and the HBAC pipeline in specific, see the Github page and a recent presentation.
[1] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2019). A survey on bias and fairness in machine learning.
[2] Helwegen, Rik, Christos Louizos, and Patrick Forré. "Improving fair predictions using variational inference in causal models." arXiv preprint arXiv:2008.10880 (2020)
[3] Misztal-Radecka, J., & Indurkhya, B. (2021, may). Bias-Aware Hierarchical Clustering for detecting the discriminated groups of users in recommendation systems. Information Processing and Management, 58(3), 102519. doi: 10.1016/j.ipm.2021.102519
Auteur: Selma Muhammad
Dit artikel is afkomstig van: Auditing Algorithmic Fairness with Unsupervised Bias Discovery (amsterdamintelligence.com)