On privacy and algorithmic fairness of machine learning and artificial intelligence

When big chunks of user data collected on an industrial scale continue to induce constant privacy concerns, the need to seriously address problems of privacy and data protection with respect to data processing is important as never before. Data is increasingly fed into machine learning models (i.e. “artificial intelligence” facilitating automatic decision making), potentially raising many concerns, including whether decisions made by such models are fair for users. Indeed, research not only indicates that machine learning models may leak the learned user data, including personal data. Concerns over biased outputs and fairness (how I view “fairness” is discussed below) are also increasingly apparent and without doubt they contribute to the mounting concerns over potential discrimination risks.

Deep learning and privacy

The pace of adoption of deep learning-based methods may also soon highlight the relation between privacy and data protection on the one hand, and fairness on the other hand - thanks to the rising popularity of differential privacy (originally introduced in this seminal work).  Adopted for specific purposes by some of the biggest companies, the technique became widely recognised in the industry. In response to identified issues, such as the risk of reconstructing user profiles, US Census Bureau adopted it for the 2020 survey. Indeed, when applied to the right problem, the technique can provide benefits. But what is the nature of the method?

Differential privacy is actually not a one-size-fits-all method. “Differential privacy is not a product”. Rather, differential privacy is a statistical property that methods such as algorithms or protocols can meet. Satisfying the property ensures that data is processed in a way that puts stringent limits on the risk of data leakage. A process (i.e. algorithm) is differentially private when its result (e.g. computation) is indistinguishable when applied to input data. Differential privacy intends to avoid a situation when an adversary wants to draw conclusions about user’s personal data based on the outcome of the algorithm. If differential privacy is used properly, this will not be possible. Differential privacy is a very strong method of privacy protection. It is the only existing model where privacy guarantees can be proven with mathematical precision. To put it simple, the technique offers learning useful information without holding data relating to individual users, and in a way that inferring data about individuals may be next to impossible.

Privacy vs Accuracy

To achieve the constraints of differential privacy, carefully tuned noise is added in the computation of the result, in a way to meet the guarantees but preserving the data utility. This means that the algorithm outcome is not exactly precise. Assuming the application to the right problem and having a big enough dataset, the actual results can be “good enough” in practice. Differential privacy is therefore about the trade-off between accuracy (data utility) and privacy (i.e. risk of inference about users). Imagine a simple example of an algorithm providing data about the number of users with a specific (e.g. income, gender, preferred emoji). For a more direct example, let’s say we publish statistics about the number of people at the European Data Protection Supervisor's (EDPS) office that like cheese and this number is 39. Let’s say that an employee joins EDPS, and this statistics is then updated to 40. It becomes very simple to conclude that the employee in fact does like cheese. Differentially private computation would however not output 39 in the first case and 40 in the second, but  different numbers that would include mathematically carefully tuned random noise; where for example the differentially private count could be 37 prior the joining of the new person, and 42 afterwards; making it impossible to conclude that any particular individual at EDPS likes cheese - so although I like cheese, my individual contribution to this statistics is well hidden. Practical applications can be much more complex, of course.

Differential privacy finds its use in many applications, it can also be applied to provide privacy guarantees of user data that is collected to train deep learning models (a subset of machine learning), giving rise to privacy preserving learning. Privacy in machine learning is important, for example in light of research works demonstrating the ability of recovering images from facial recognition systems, or even personal data such as social security numbers from the trained models. But focus put just on security, privacy and accuracy might not be enough.

The risks of biased data

It is increasingly apparent that biased data used in machine learning model training may become reflected and even reinforced in the predictive output, making such biased input data influencing future decisions. This problem is especially acute when such machine learning applications would result in unfair decisions for selected small subpopulations in otherwise larger datasets. In doing so, algorithms would lead to the optimisation of answers towards the larger, better represented groups.  To put that in a practical context, instead of considering the all too academic notion of a “small subpopulation”,  think about the actual traits such as gender, medical conditions, disability, or others. A generalising algorithm might inadvertently be made to “prefer” the better represented data records. Those concerns are important from the digital ethics point of view, making issue of fairness an area of focus.

What is fairness?

But what is fairness in the first place?  In this post, I do not consider fairness in the meaning of Article 8 of the Charter of Fundamental Rights or Article 5(1)(a) GDPR. Rather, I use fairness as  a technical term. Machine learning is a mathematical concept so notion of fairness must be designated within this realm. Many mathematical metrics of fairness exist but discussing this fascinating area is outside the scope of this post - let’s just say that there is no standardised “ideal” metric of fairness. But a particular one to consider could be the equal opportunity, also known as equality of true positives. In this metric, fairness is about not conditioning of decision outcomes on specific, perhaps hidden traits, in a way that those protected, underrepresented groups are not treated unfavourably. In practice, one could imagine these groups being described with a demographic character, based on traits such as gender, age, disability or so on. Concrete example situation could be in hiring decisions: the chances of being hired should be equal, regardless of attributes such as gender or age.

When having a direct access to the data, methods to compensate for model unfairness can be devised and applied. But what if one actually does not have access to data, like in the case of differentially private methods? The matter may become complicated.

Privacy vs Fairness?

But it turns out that in reality the matter is actually much more complicated, as pointed out by latest research highlighting an inherent relationship between privacy and fairness. In fact, it becomes apparent that guaranteeing fairness under differentially private AI model training is impossible when one wants to maintain high accuracy. Such incompatibility of data privacy and fairness would have significant consequences. With respect to the potential of unfairness of some of the standard deep learning models, when it comes to fairness, the current differentially private learning methods fare even worse, reinforcing the biases and being even less fair to a great degree. Results like that should not exactly come as a surprise to implementers and deployers of the technology. Hiding data of small groups is actually among the features of differential privacy. In other words, it is not a bug but a feature of differential privacy. However, this feature leading to decrease of precision might not be something desirable in all use cases.

The problem is fortunately gaining traction. One may expect that further work will not only shed more light on the potential consequences and the possible impact on fairness. There are already works exploring the possible trade-offs even further. Such as the work published in June 2019 (Differentially Private Fair Learning) which further explores these trade-offs between privacy and fairness and introduces methods addressing some of the issues, even though for a price - and that means  having access to some of the sensitive traits of those potentially affected small subgroups. But if those traits are truly sensitive (i.e. special category of data) this would mean that the system in question would actually process this data.

How to tackle the problem

As with any system designed to be used for special cases, all the trade-offs need to be carefully considered.  Indeed, to deploy a differentially private system today, considering the use on a case by case basis might be the way to follow anyway, since differential privacy is not a technique for universal use (it’s not a magical solution to all the problems). When analysing the use case, experts and deployers should be wary of the full consequences for the fundamental rights and freedoms with respect to personal data processing.

Specifically we could envision:

  • Deployers considering problems holistically, understanding their requirements, as well as the needs of users. Considering the latest state of the art when improving privacy protection is already an essential privacy by design need. While we do not know what might be the potential undesirable impact of already deployed differentially private methods, if any, it may be a good idea to include considerations such as fairness in the risk assessment when considering the impact on fundamental rights and freedoms.
  • Indeed in some cases, for example when processing large datasets or applying artificial intelligence methods, in which the innovative use of differential privacy may be useful may already warrant carrying a data protection impact assessment (DPIA) may have been a beneficial or needed step anyway. In such cases, DPIA would currently be the right place to include such a detailed technical analysis, to explain the rationale behind the chosen methods and their configurations (including the configuration parameters).
  • When designing AI systems, assessing and understanding of the algorithmic impact should be seen as beneficial; DPIA should serve as a good place for such considerations.

Summary

To summarise, when considering any complex system many aspects need to be weighed in. This analysis is often done on a case by case basis. Differential privacy can help in the development of systems that provide trade-offs between privacy and data utility, with provable guarantees. But some settings, such as in the training of neural networks with, the use of differential privacy or not, considering other aspects may be also of note. Exacerbating unfair treatment of disadvantaged groups should not become a feature of modern technology. Fortunately, today these challenges start to be explored.  In case of the potential implications of differentially private deep learning on fairness the most important piece of initial work is ongoing: potential problems are identified and now work should continue to address it. In the meantime, big deployers of differentially private methods may consider it justified to explain if the specific ways in which they use the technology might have undesirable impact, if any at all. That would simply be a matter of transparency.

New exciting technologies can help process data with respect for privacy. Differential privacy offers significant improvement for privacy and data protection. Still a nascent technology, and by this very nature its use warrants case-by-case analyses. This is fortunate because organisations or institutions considering to use differential privacy have the opportunity to see the big picture and address their actual, specific situation and needs. Improving privacy and data protection of data processing with new technologies and conforming to the latest state of the art should be the standard. But it’s worth to keep in mind of what may be the overall implications. Where needed, explaining these technical considerations and rationale in a document like data protection impact assessment might be helpful.

This analysis first appeared publicly on the blog of the European Data Protection Supervisor (2019)