Here at Civis, we build a lot of models. Most of the time we’re modeling people and their behavior because that’s what we’re particularly good at, but we’re hardly the only ones doing this — as we enter the age of “big data” more and more industries are applying machine learning techniques to drive person-level decision-making. This comes with exciting opportunities, but it also introduces an ethical dilemma: when machine learning models make decisions that affect people’s lives, how can you be sure those decisions are fair?
A central challenge in trying to build fair models is quantifying some notion of ‘fairness’. In the US there is a legal precedent which establishes one particular definition, however this is also an area of active research. In fact, a substantial portion of the academic literature is focused on proposing new fairness metrics and proving that they display various desirable mathematical properties. This proliferation of metrics reflects the multifaceted nature of ‘fairness’: no single formula can capture all of its nuances. This is good news for the academics — more nuance means more papers to publish — but for practitioners the multitude of options can be overwhelming. To address this my colleagues and I focused on three questions that helped guide our thinking around the tradeoffs between different definitions of fairness.
For a given modeling task, do you care more about group-level fairness or individual-level fairness? Group fairness is the requirement that different groups of people should be treated the same on average. Individual fairness is the requirement that individuals who are similar should be treated similarly. These are both desirable, but in practice it’s usually not possible to optimize both at the same time. In fact in most cases it’s mathematically impossible. The debate around affirmative action in college admissions illustrates the conflict between individual and group fairness: group fairness stipulates that admission rates be equal across groups, for example gender or race, while individual fairness requires that each applicant be evaluated independently of any broader context.
Is the ground truth for whatever you are trying to model balanced between different groups? Many intuitive definitions of fairness assume that ground truth is balanced, but in real-world scenarios this is not always the case. For example, suppose you are building a model to predict the occurrence of breast cancer. Men and women have breast cancer at dramatically different rates, so a definition of fairness that required similar predictions between different groups would be ill-suited to this problem. Unfortunately, determining the balance of ground truth is generally hard because in most cases our only information about ground truth comes from datasets which may contain bias.
Speaking of, what types of bias might be present in your data? In our thinking we focused on two types of bias that affect data generation: label bias and sample bias. Label bias occurs when the data-generating process systematically assigns labels differently for different groups.
For example, studies have shown that non-white kindergarten children are suspended at higher rates than white children for the same problem behavior, so a dataset of kindergarten disciplinary records might contain label bias. Accuracy is often a component of fairness definitions, however optimizing for accuracy when data labels are biased can perpetuate biased decision making.
Sample bias occurs when the data-generating process samples from different groups in different ways. For example, an analysis of New York City’s stop-and-frisk policy found that Black and Hispanic people were stopped by police at rates disproportionate to their share of the population (while controlling for geographic variation and estimated levels of crime participation). A dataset describing these stops would contain sample bias because the process by which data points are sampled is different for people of different races. Sample bias compromises the utility of accuracy as well as ratio-based comparisons, both of which are frequently used in definitions of algorithmic fairness.
Based on our reading my colleagues and I came up with three recommendations for our coworkers:
Whether or not ground truth is balanced between the different groups in your dataset is a central question, however there is often no way to know for sure one way or the other. Absent any external source of certainty it is up to the person building the model to establish a prior belief about an acceptable level of imbalance between the model’s predictions for different groups.
To paraphrase a famous line, “All datasets are biased, some are useful.” It is usually impossible to know exactly what biases are present in a dataset, so the next best option is to think carefully about the process that generated the data. Is the dependent variable the result of a subjective decision? Could the sampling depend on some sensitive attribute? Failing to consider the bias in your datasets at best can lead to poorly performing models, and at worst can perpetuate a biased decision making process.
There is an understandable temptation to assume that machine learning models are inherently fair because they make decisions based on math instead of messy human judgements. This is emphatically not the case — a model trained on biased data will produce biased results. It is up to the people building the model to ensure that its outputs are fair. Unfortunately there is no silver bullet measurement which is guaranteed to detect unfairness: choosing an appropriate definition of model fairness is task-specific and requires human judgement. This human intervention is especially important when a model can meaningfully affect people’s lives.
Machine learning is a powerful tool, and like any powerful tool it has the potential to be misused. The best defense against misuse is to keep a human in the loop, and it is incumbent on those of us who do this kind of thing for a living to accept that responsibility.