Article

Part 3: Analysis and conclusion: Analyzing Bias in Machine Learning: a step-by-step approach

This is the third part of three two blog posts regarding the work we have been doing at the city of Amsterdam, focusing on bias in machine learning. In part 3, we’ll discuss the results, conclusion, and future considerations. If you’ve made it this far after having read part 1 and part 2, you should now have a decent understanding of some important concepts regarding bias in machine learning and the related methodology. Recall that in these blog posts, we are concerned with ensuring that different groups of people are treated fairly by our machine learning model when a suspicion of illegal holiday renting at their address arises.

Preliminary analysis

The data was split into two parts: a training set and a test set. There are multiple ways of performing this split. In this project, we used the following two:

  1. Temporal: the data is ordered chronologically before performing the split, obtaining a train set that contains cases that happened earlier than any of the cases in the test set.
  2. Shuffle: the data is shuffled randomly before the split, obtaining a mix of older and newer cases in both the train and the test set.

When the bias analysis started, the model was trained using a temporal split, and 20% of the data was left as the test set. This is the most natural way of splitting the data since the model in production will predict recent cases. The results were surprisingly biased in almost all the investigated attributes, with an average FPRR distance of 0.40 from the non-bias value. Such a negative result was suspiciously bad, so we further investigated it.

We analyzed the distribution of the values of all the features used by the model, comparing the training and the test set. This revealed that the feature generated by counting how often an address was reported by citizens contained some zero values in the training set, but never in the test set. This means that all the cases in the test set were opened when a report was received, while in the past, cases were also opened without a report, using other processes.

We suspected that this was the reason why our results were so biased in every aspect. To confirm this hypothesis, we splitted the data once again, this time using the shuffle technique. The idea is that the shuffle technique, mixing old and new cases, will create a new test set containing cases with and without reports.

One could object that this could lead to fake better performances because the model would learn information from the future and use that to classify a case of the past. This concern was addressed by checking if the model had the same performance using both splitting techniques, which was indeed the case.

The bias analysis of the model trained over the data split using the shuffle technique performed noticeably better, reducing the average false positive rate ratio distance from 0.40 to 0.16. This result is a big hint that our intuition was correct.

To confirm that the shuffling technique was not what reduced the bias (but the presence of cases without any report in the dataset) we did a test removing from the shuffled data the cases where the number of reports was equal to zero. The result was that the model contained a high bias, achieving an average FPRR distance of 0.31.

From this analysis, we can conclude that it’s important to create cases using different processes than just by relying on reports from the citizens. These extra cases are necessary to reduce the bias of the investigations. These different processes stopped temporarily during the coronavirus period, because of the influence that the virus had on the hospitality industry and to protect investigators and investigated against infection.  At this moment the processes resumed and they are generating non-report cases again. For the analysis of the features, the shuffle split test set will  be used.

Results from analyzing direct bias

For direct bias, we had a few features that could directly map one of two sensitive attributes sex and age.

Sex

Gender is used to describe the characteristics of women and men that are socially constructed, while sex refers to those that are biologically determined. Most people are born as what we define female or male, and learn to be girls and boys who grow into women and men. This learned behavior makes up gender identity and determines gender roles. We analyzed only female and male, and no other sexes, since these were the only sexes present in the data. However, if in the future this information will be available we will consider other sexes as well.

The analysis over sex was done by analyzing the features directly derived from it. The split was done by setting as the privileged group if no women were living at the address, and the unprivileged if at least 1 woman was living at the address, mapping potential discrimination against women.

The analysis revealed discrimination against the privileged group. So, instead of discrimination against women, in this case men  were discriminated against. Given our main error metric, this means that the error made by the model on cases with only male occupants was bigger than the error on cases with at least one female occupant.

The outcome of this analysis combined with the low feature importance led to the removal of the feature itself. After removing the feature from the model, the bias was checked again, and it was successfully mitigated.

Age

The bias on age was checked using features derived from age, such as the average age of the occupants. Since this is not a boolean feature like sex and it also doesn’t have a logical threshold, the decision of where to split it was taken by analyzing the distribution of the feature.

The split chosen was around the 4th quartile, with the purpose of dividing the 75% younger part of the population from the 25% older. The false positive rate ratio for the average age of occupants was 1.11. Since the result is between 0.80 and 1.25, it cannot be considered a bias against age. To be sure, we analyzed a few more split values, which all resulted in a non-bias score. Nevertheless, since age is a sensitive attribute and the feature importance was not very high, we decided to remove the age features.

Results from analyzing indirect bias through underlying attributes

To analyze the indirect bias by the underlying attributes, we defined the legal grounds to have access to the Basisregistratie Personen (BRP). This made it possible to retrieve the sensitive attributes of nationality, country of birth, and civil status for the purpose of the bias analysis (never for training).

Nationality and country of birth

The analyses of nationality and country of birth were both done the same way. To avoid describing the same steps and logic multiple times, here we'll explain our approach with nationality, but the same was done for the country of birth.

The first split we created was between addresses where all the occupants were Dutch and addresses where this wasn’t the case, to discover any discrimination against Dutch or non-Dutch families. However, discrimination is not only based on being Dutch / non-Dutch, there is also broader discrimination against people who are from specific countries. To investigate this kind of discrimination the country list was divided into two groups: western countries and non-western countries. We followed the definition of the Dutch national bureau of statistics (CBS) in this: in their statistics, countries in Africa, Latin America and Asia (excluding Indonesia and Japan), and Turkey are counted as non-western. Based on this, we created a second split, mapping at which addresses all the occupants had a western nationality.

An extra check was made extending the idea of the previous two features but using a ratio instead of a boolean value, for instance the ratio of people of western nationality relative to the total number of occupants. The split between the privileged and the unprivileged group was done on a ratio of 0.5, identifying in this case the addresses where at least half the occupants were of western nationality.The results are as follows:

Feature

False positive rate ratio

All Dutch nationality vs. not all Dutch

0.95

All western nationality vs. not all western

0.90

All born in the Netherlands vs. not all born in the Netherlands

0.90

All born in western country vs. not all born in western country

0.88

Ratio of Dutch nationality occupants relative to total number of occupants

0.97

Ratio of western nationality occupants relative to total number of occupants

0.90

Ratio of occupants born in the Netherlands relative to total number of occupants

0.95

Ratio of occupants born in western country relative to total number of occupants

0.90

 

Since all the results are between 0.80 and 1.25, the outcomes of the model cannot be considered biased against nationality and country of birth attributes.

Civil status

Civil status indicates if someone is for example married, divorced or single. It’s a categorical column with multiple possible values. We mapped each of these categories to a feature indicating the ratio of how many occupants at an address belong to that category divided by the total number of occupants. The groups split was set on the value 0.5.

The categories are the following:

  • no marriage;
  • divorced;
  • widow;
  • married;
  • registered partner;
  • unknown;
  • registered partner left behind;
  • divorced registered partner.

This is the result of the analysis:

Feature

False positive rate ratio

Ratio of unmarried occupants relative to total number of occupants

0.90

Ratio of married occupants relative to total number of occupants

0.90

Ratio of occupants with a registered partnership relative to total number of occupants

-

Ratio of occupants divorced after marriage relative to total number of occupants

1.13

Ratio of occupants divorced after registered partnership relative to total number of occupants

1.34

Ratio of occupants widowed after marriage relative to total number of occupants

0.86

Ratio of occupants widowed after registered partnership relative to total number of occupants

1.72

Ratio of occupants with unknown civil status relative to total number of occupants

0.95

 

Three groups that deserve more investigation:

  1. Ratio of occupants with a registered partnership relative to total number of occupants
  2. Ratio of occupants separated after registered partnership relative to total number of occupants
  3. Ratio of occupants widowed after registered partnership relative to total number of occupants

 

The first one was not scored at all, because there was actually no variety in this feature. The other two have only very few (1 or 2) observations with a value different from zero, leading to high, but also unreliable metrics. In hindsight, they should’ve been grouped together with married, divorced, and widowed respectively, since marriage and a registered partnership are virtually equal for the law in the Netherlands. Because of this similarity, and because no bias was found for the marriage-related civil statuses, we think it is safe to assume this also isn’t a problem for the categories related to a registered partnership. Of the other categories, none can be considered biased.

Non-measurable sensitive attributes

The bias on the following sensitive attributes was not measured since the data was not available:

  • Social class
  • Religion
  • Skin color​
  • Ethnicity​
  • Sexual orientation​
  • Political view​
  • Pregnancy​
  • Health​
  • Genetics
  • Gender
  • Disability

Results from analyzing indirect bias through features

The remaining indirect bias hypotheses were analyzed by splitting on the features involved. The resulting fairness metrics were always between 0.8 and 1.25, so in an acceptable range.

Conclusion

Altogether, direct and indirect bias were analyzed, obtaining useful insights regarding the model and the data used by it. The first important outcome is a confirmation of the intuition that it’s important to create cases using different processes, not only relying on reports from the citizens. The analysis proves that these extra cases are necessary to reduce the bias of the investigations. Since the creation of cases using other methods has resumed, these cases merged with the cases opened after a report is received means that overall, the error made by the model that cannot be considered biased.The second important outcome is that men were discriminated against when the model was trained using sex as a feature. The error made by the model over cases with only male occupants at the address was bigger than the error when there were also, or only, female occupants. This outcome, combined with the low feature importance, led to the removal of the feature sex.

Considerations for the future

Bias in model outcomes vs. underlying data

The purpose of our bias analysis was to ensure that our holiday rental model is not biased. For that reason, we focused on detecting bias in the model outcomes and not in the underlying data. Since we did obtain some results suggesting bias in the underlying data, it would be interesting in the future to analyze the cases opened as a result of citizens’ reports to better understand the bias that those reports carry.

Investigating unavailable sensitive attributes

It would also be interesting to investigate what can be done to better analyze the bias on sensitive attributes that are not available within the municipality. We did take the missing attributes into account by including them when making hypotheses of correlations that lead to indirect bias, but this is obviously not ideal. For instance, we could look for information about the missing attributes in other places, such as at the CBS, or see if we can approximate them by measuring other, correlated variables. On the other hand, some sensitive attributes are not available to the municipality (or anywhere for that matter) for very good reason and an argument can be made that we should not try to get or approximate them even for the purpose of a bias analysis.This is a broader discussion that has not concluded yet.

Creating groups

The analysis is limited by the choices that were made for the division of cases into privileged/unprivileged groups. For example, we cannot be sure that no bias would have been found if the groups had been created differently. We deem this risk small, since the groups were carefully crafted based on theory and the distributions seen in the data. Moreover, multiple variants were tested for the features that did not have clear splits. However, since the groups were all based on one feature, something we cannot rule out is that bias is present in a subgroup of cases that is not defined by just one feature, but by a combination of features. Another point to note is that the groups are influenced by the assumptions of the people making them. That’s why it’s important to carefully discuss them with a diverse group of people. Also, work on alternative methods that do not require groups to be specified upfront is being done.

Selecting fairness metric

As for the fairness metric, selecting one main metric proved to be essential to make the analysis feasible, but the results should be interpreted accordingly; we cannot make definite statements about the fairness of the model according to other metrics. In our case, the type of error with the worst impact for a citizen was rather clear and the choice for a main metric followed from that easily. If that’s not the case, focusing so strongly on one metric becomes more problematic.

Bounds on fairness metric

Although rooted in historical studies, the use of the four-fifths rule is in the end a subjective decision. For applications with a particularly high impact, it may be too high. Likewise, uncertainty about the true values of our metrics due to small sample sizes could prompt tighter bounds, since a metric that falls just within the bounds may in reality be much larger. In this project, we concluded that the four-fifths rule should be suitable. An alternative could perhaps be to estimate the standard deviation of the fairness metric by bootstrapping, and then seeing if the non-bias value of the fairness metric lies within a certain confidence interval.

Summed up

All in all, the methods described here are not perfect and still under active development. Furthermore, they should be embedded in a broader framework of (legal, ethical, etc.) fairness interventions throughout the project. This is therefore not a clear-cut solution to all bias problems, but rather an example of what a bias analysis can look like in practice, from a technical standpoint.There’s an abundance of theory on bias in machine learning, but practical examples are hard to find. That’s a shame, because the easier it is to deal with bias, the more data scientists will do it. We hope that this piece has contributed to that.

 

Auteurs: Sebastian Davrieux, Meeke Roet, Swaan Dekkers & Bart de Visser

Dit artikel is afkomstig van: Part 3: Analysis and conclusion: Analyzing Bias in Machine Learning: a step-by-step approach (amsterdamintelligence.com)

Image credits

Header image: Taken from https://www.smartcitiesworld.net/news/news/ai-algorithm-capable-of-multi-task-deep-learning-2419

Icon image: Machine learning 6