Inclusive Machine Learning


Google seems to treat fairness and inclusivity as very important subjects in machine learning. The data used to train ML models, if not thoughtfully curated, can introduce bias into the model itself. The goals of this section are to learn how to:

  • identify the origins of bias in ML,
  • make models inclusive, and
  • evaluate ML models with biases.

Machine Learning and Human Bias

The objective of this section is to raise awareness of how human biases can create bias that is baked into the technology we create. Types of biases:

Interaction bias: the video presents a situation where humans asked to draw shoes continually draw pictures of shoes that resembled converse, to the point that the model did not recognize high heels as shoes at all.

Latent bias: if a ML model is trained on images of past physicists, then it will tend to recognize bearded men. The model will skew toward men, at the expense of women.

Selection bias: a random selection of one individual’s photos may not be representative of all human faces.

Evaluating Metrics for Inclusion

One of the best means of evaluating a models inclusiveness is to check the models confusion matrix not only for the entire population of test data, but also for subgroups within the dataset. Recall the following two definitions.

False Positives (FP), also known as type 1 errors, occur when the model infers the presence of something that is not actually there. False Negatives (FN), also known as type 2 errors, occur when the model fails to infer the presence of something that is actually there.

Statistical Measurements and Acceptable Tradeoffs

Depending on the nature of the model, false positives may be more of an issue than false negatives. Or, vice versa. In particular, there are two important error rates.

$$\text{False Positive Rate}=\frac{\text{False Positives}}{\text{False Positives}+\text{True Negatives}}$$

$$\text{False Negative Rate}=\frac{\text{False Negatives}}{\text{False Negatives}+\text{True Positives}}$$

There are also two important success rates.

$$\text{Precision}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Positives}}$$

$$\text{Recall}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Negatives}}$$

Whether it is more important to optimize a model to minimize false positives or false negatives depends on the context the model operates in.

As an example, consider an ML model that determines whether an image should be blurred to preserve privacy. A False Positive in that situation, that blurs part of an image that didn’t need it, would be much less catastrophic than a False Negative, which failed to blur personal information that was in an image and, as a result, compromised someone’s identity. This type of model should be optimized for a low False Negative rate.

Consider a different type of ML model, a spam filter. A False Negative would be a spam email that gets routed into the normal mailbox. This is a minor inconvenience. On the other hand, a False Positive, where a potentially important personal email is classified as spam, would be a bigger loss. So, this type of ML system should be optimized for a low False Positive rate.

I discuss the process of binary classification in a different note, but the jist is that it involves selecting a threshold value that strikes an appropriate balance between False Positives and False Negatives. This balance is calculated by assigning weights to both False Positives and False Negatives, and then minimizing the product-sum of the weights and the counts of False Positives and False Negatives.

The added wrinkle that Google proposes is that in addition to testing the performance of the model globally, across the set of all test cases, the model’s performance should also be tested on each of the subgroups. Acceptable tradeoffs have to be considered for the global dataset, but the model’s performance should be acceptable for all of the subgroups as well.

Equality of Opportunity

Google presents an approach to model optimization called Equality of Opportunity. Many ML models’ goals are to determine whether a particular individual meets some minimum criteria for acceptance. As examples, think of credit cards or university applications.

In these examples, the test data may have the typical creditworthiness or student information, such as income or SAT scores. Beyond just this information, the test data may also have labels for group membership. Some special groups are known as “protected groups,” a legal term. As an example, federal law prohibits discrimination based on age.

Google suggests that members of protected groups should have the same opportunity to meet the criteria as nonmembers. They suggest creating situations where the true positive rate for both groups are equal, which seems to imply different thresholds, or thresholds that are a function of group membership.

This is an ethically thorny area that requires careful consideration. An alternative would be to pursue what is called a “group unaware” approach, wherein the thresholds are the same for all groups. Google argues that group unaware approaches effectively discriminates against groups that collectively perform worse on the objective measurements, such as GPA or credit score.

The ideal that these techniques strive for is creating a situation where all users who qualify for a desirable outcome have equal chance of being correctly classified, irrespective of group membership. But, obviously, this is tricky. The takeaway here is that there are many ways of optimizing ML systems, and that training data can introduce biases if ML practitioners are not very conscientious and diligent about these issues.

And, it would be best if your facial recognition software did not “prefer” certain nationalities over others. That much, at least, everyone can all agree on.

Finding Errors in your Dataset using Facets

Facets is an open source visualization tool developed at Google. It is intended to aid development of data intuition by facilitating visualization of various aspects of complex datasets, some of which may have millions of data points each having hundreds or thousands of features. Obviously that level of complexity makes gaining thorough understanding of data very difficult.

Facets can highlight common data issues that can impeded machine learning, such as high percentages of zeros, unexpected feature values, features with many missing values, or features with unbalanced distributions or distribution skew. It can visualize continuous as well as categorical and discrete features.

Here is a brief introduction to the tool.