Privacy of London Tube Wifi Tracking

Users of public transportation are mainly interested in one thing: getting to the right place conveniently and fast. So do I. Public transportation systems around the world struggle with maintaining their systems as efficient as possible. Transports for London (TfL) is perhaps in the avant-garde here. They are on the forefront, coping with sharp edges and ever-increasing commuter base. This is impressive. Believe me, I also want to get as soon as possible to South Kensington for an eclair (or five), without unnecessary delays. But I also care about privacy and data protection.

TfL’s latest idea was to better understand their commuters’ mobility patterns using wifi tracking monitoring. These technologies are becoming ubiquitous at shopping malls (e.g. at BHV, Paris) or at airports. The current regulatory landscape requires one thing: visible display that wifi tracking takes place, and some random protections. This will change with GDPR and ePrivacy.

First of all, let me be clear that it is notoriously difficult to anonymize mobility/location data. There are ways for doing some but these are often need to be crafted for specific applications.

Generally speaking, every smartphone has an unique ID, MAC address. The network infrastructure we use can see this when we’re using wifi networks, but also when the smartphone scans the available networks prior to connection. That’s how it’s possible for network infrastructures to know that a particular user is around. When you have many wifi access points, you can precisely track moving individuals. This is exactly the thing TfL tested, and plans to deploy to operate on a standard basis.

To simplify, let’s just say that TfL was collecting MAC wifi addresses of smartphones of tube users. MAC addresses are private data (there are many ways this conclusion can be reached, here’s just one), and the sensitiveness of course depends on the context.

I will make some points clear upfront.

One could see indications that TfL would not comply with GDPR.
I am also afraid that - despite what they attempt to say - they could comply with Subject Access Request (the tool in Europe that lets anyone ask what a data controller has on a user; to be strengthened in GDPR)
I’ve seen their Data Protection Impact Assessment (DPIA; more about these here: 1, 2). In my opinion it does not exactly comply with GDPR. On the other hand, TfL says it complies with the (current) UK Data Protection Bill, but UK Data Protection Bill does not use the name of “data protection impact assessment” (in UK, it’s more common to call it a privacy impact assessment), so I reckon TfL attempted to go with GDPR, but it all ended up kind of fuzzy

Now to make some other matters clear upfront, too:

TfL did in fact do something to protect privacy (as opposed to “they did nothing”)
They even communicated that to some extent. Although their privacy communication is inconsistent. On the other hand and at the least, the consistent and visible lack of technical strictness in their final report is worrying. It should make the reader wonder if they really did what they say they did?
They conducted a privacy impact assessment (or as they say, a DPIA)! That’s good because UK law does not require them doing that. GDPR would definitely require TfL to conduct one.
Some facts indicate that their DPIA was a process, that is good.
I've seen their DPIA - they even made public consultations, which is again, good (second view: they did not ask for an independent advice - in the deployments of their kind, that would be welcome to add credibility to their assessment; they attempted to do all in-house)
They say they won’t share the data they got, which is good. I have concerns that they would not be able to guarantee data privacy. That said, I believe TfL is exploring ways of monetising user data by dynamic advertisements, using wifi tracking data. If this was done, we should expect more details to be sure that data is not shared

The results of the pilot study are here. In the rest of this post, I’ll write about my concerns.

I’m involved in the business of making privacy reviews (some public examples: 1, 2, 3, ...) and impact assessments. Reviewing a DPIA conducted by an organisation - to validate it - is not an unusual engagement, it’s a standard practice and should be understood as a good Privacy by Design practice.

This example is very interesting to me. Moreover, since TfL is probably the first in the world large-scale public transportation system taking advantage of new tracking technologies, it’s important to say some things. Those operators closely watching TfL should make sure they don’t make the same mistakes.

TfL’s data protection, or not?

First of all, TfL says they do not store raw data such as (to simplify):

MAC address, time, location

They say they do anonymise or pseudonymise (first point of TfL’s communication inconsistency) data, so that the following is stored:

function(MAC), time, location

The nature of the “function” is the key. Details aren’t known, but I can easily deduce them only from the things that are public.

Encryption or Hash?

Raw MAC addresses are not stored. Stored are identifiers “processed" with a special function.
TfL says data is "irreversibly encrypted". Even if the technical way of encryption was to be made public, it would supposedly still be impossible to recover the MAC of an individual user. I believe them. But let’s put that aside. One way I can imagine that could provide “irreversibility” would be to employ a scheme taking advantage of a hash function.

But that’s not what TfL says they do (they say it’s encryption).

Is it a good idea to use encryption (ciphers) in pseudonymization schemes? It can absolutely be used. However, ciphers require encryption keys. So I could now go on and explain how one could deploy a pseudonymization scheme using encryption (or how to do better, using other approach), but I won’t.

I assume this is another (second) indication of TfL’s inconsistency and worrying example of lack of technical clarity. Instead of encryption, I assume that TfL indeed employed a scheme based on hash functions, since they speak of “irreversibility” and “salts”, but not “keys”.

They use salts

Consequently, I assume TfL used a scheme employing a non-keyed hash function with a salt.

Having a MAC such as 79:28:fa:ee:60:a4, a salt such as “fish”, and a hash function (I choose Sha256) they would end up with a following identifier: 984568e0ab67a6f3a7d08b4e2cf34dec725ead4c451f4241dbb41effd1d28624.

Salt is a good thing here - if their database would get stolen or leaked, it’s pretty easy to recover MACs back. But with the addition of a salt, it is slightly more difficult. That said, I could imagine ways of overcoming this in the specific case of TfL, but this post is devoted to something else.

From TfL’s final report, I also assume that they used a constant salt, so for every MAC address of every commuter, the salt would be identical (“fish” in my example, in practice this should be a random value).

This is a valid pseudonymization scheme, albeit probably not the strongest I could imagine in their case. I would like to learn what threat model TfL has used when choosing their scheme, but having read their DPIA - I did not find it.

TfL scheme is not irreversible!

In fact, functionally, it is definitely reversible, or at least it is very easy for TfL to recover the MAC. Despite what TfL says in their report, they also give us all the information to conclude it is reversible (i.e. they likely said something is true and false at the same time). But to understand its strength we would need to know:

how the salt is generated
how other components of the scheme (i.e. “encryption”) works

This information should be included in the DPIA. Specifically, why the weak version of easily reversible pseudonymization has been used (i.e. salt never rolled, only purged after the whole study)), and what’s the threat model. One rationale I could come up with is that TfL needed the data to for their studies, but then a DPIA would be far better if this information would be written in the DPIA, in line with the proportionality principles. It should also be weighed against their privacy threat models.

TfL’s words again

Let’s now cite TfL’s words from their report:

The salt is not known by any individual and was destroyed on the day the data collection ended. Therefore, we consider the data to be anonymous and are unable to identify any specific device.

In other words:

TfL says the data is anonymized, which is pretty much in direct contradiction with their own words (from the same page 22 of the report) that it was pseudonymized!
The salt they used was supposedly not known (how about accessed, or generated?) to any individual, which is hard to believe because someone had to technically design, implement and deploy the system. Should we believe that eyes were closed? It’s impossible to know without the knowledge of salt generation details.

Now when we know it all, let’s think whether TfL really could not comply with Subject Access Request (they declined a few).

Summing some of the points:

TfL says their scheme is anonymisation and pseudonymisation, so it’s not anonymisation
TfL says the scheme is irreversible, and - at the same time - that it’s not irreversible

Subject Access Request (SAR)

TfL says they were unable to comply with SARs because they don’t have the data:

As we cannot process known MAC addresses in the same manner as we did in the pilot, we are unable to complete any Subject Access Request for the data we collected.

My understanding is they maintained the same stance during the experiment, and when they held the data. TfL has refused to comply with a SAR and maintained that in a reply to a number of FOI requests (as well as for “logics of processing” requests). But we should now be able to judge whether TfL could or could not comply (reasonably, feasibly).

TfL has had the data
TfL had full control over the technical infrastructure
TfL had full knowledge of the used pseudonymization scheme
TfL has even admitted publicly that the “salt” they used was removed after data collection

I leave the reader to conclude whether TfL was unable - or in fact was able, to reverse the scheme.

Summary

London is an example to the world. So is their public transportation system. It’s easy to imagine operators of public transportation systems in other cities or countries would follow TfL’s lead in using new ways of understanding how their customers use the tube network. That’s why it is good that they would also learn from TfL about the importance of privacy protection.

What is less good is that they can learn not exactly the right things. It seems to be that the privacy protection section in TfL’s report was written by a person who did not exactly know what he was speaking of (that’s a justified impression). Indeed it can undermine the security and privacy protection guarantees TfL says that were in place - or provide such an impression. Privacy-conscious customers of TfL now have good reasons to worry.

Furthermore, I point out that some technical measures that TfL says they have/had are inadequate, and that they could, in fact, recover the data they (in some places) said were anonymous

Lastly, I would be happy if TfL could disclose the details of their technical function they use to process the data. According to their public communication, the scheme they use is strong enough and won’t be affected by making it public.

I would say that I challenge TfL to disclose the details, but this is not my role.

I would also want to hope that TfL will deploy meaningful opt-out - maybe even better opt-in? - scheme that would not simply require users to turn off wifi or their phones. If TfL would want to increase customer trust, making the DPIA with technical details available to the public would be a good step. Actually, a DPIA is also required for the new deployment, especially if the system is to work in a continuous manner

Finally, let’s hope that the ePrivacy Regulation, currently at works in the European Union will provide adequate protection (more on this here, here and here)

Interested in more analysis of this kind? Consider supporting me at the btc address: 1HzvUk6RsLRkdFG7aRq7xwtLRgrKP2sqBY