Organizations voluntarily creating big public data breaches are rare. Recently it became widely known that the Public Transport Victoria (PTV) published a dataset of possibly over 15 million users. It was “anonymized”, but PTV may now still face a $336,000 data protection fine. How did this happen?

Data Science Melbourne organizes an annual data science hackathon devoted to finding innovative uses of data. In 2018 PTV provided the real, longitudinal data on the use of the “myki” transportation card (Londoners: think Oyster card). The data comprised of “touch on - touch off” events, with high granularity (accuracy of one second) and contained location information (i.e. vehicle id, route/stop numbers), unique identifier created out of the replacement for a unique 'Myki number’, which allowed to attribute each route with actual uses, card type, etc.

What went wrong and lessons learned

  1. Participants of the hackathon were instructed to use the data in any way they please (no NDA signed; so one even posted it online later on). This should not happen.

  2. The dataset was stored in an unprotected public AWS S3 bucket. Anyone could download it. This should not happen but indeed is a popular pitfall.

  3. The protection method - replacing unique “myki card id” with a different unique number - was unsurprisingly found flawed. This could not be sufficient as the location data is still present and it conveys exactly the same information as previously. As a result, it allowed reidentifying travelers, co-travelers, and even local politicians (members of the local parliament), for example with auxiliary information. The research itself comes from respected researchers and offers valuable insight. Find it here.

  4. To validate the sharing of the allegedly “anonymized” data, PTV first conducted a privacy impact assessment (PIA; in Europe now often called data protection impact assessment). This exercise was unfortunately severely flawed. But this was the basis for sharing of the data. The PIA was conducted by an internal “data scientist” (and not a privacy expert). It was then happily signed off by the CIO and the legal team, as a formality. No subject matter experts were consulted. Furthermore, no deidentification policy even exists internally.

  5. This PIA was found to be a mere check-list exercise, another flaw. When you follow this pattern, the risk of shooting in the foot increases significantly. Especially for unorthodox uses you should follow a custom-designed process.
    Specifically, the PIA document said: “No personal information capable of identifying [an] individual will be disclosed”. This created a false sense of the level of privacy guarantees. It was later seen that this false sense was held internally in the organization, including during the investigation. The flaw is fortunately observed by the local data protection authority.

Although it is difficult to compute the privacy loss in monetary units, PTV now risks getting a 336,000 USD data protection fine (6000 times the unit rate 165.22 Australian dollar, as per the investigation report.

Summary

Anonymizing location data is hard. If you absolutely need to do this, better consult someone knowledgable.

Privacy impact assessments should not conform to fixed templates. These should be strict, technical analyses.

But there is also potentially another less pleasant issue. If a decision is made to do something on a business needs basis, PIA may be reduced to a mere formality. Because the result is known in advance, so a potential temptation to find the right PIA result then (“no risk”) may be felt. If this is the case, no methodology will save you.

Ps. This is also a good example why banning reidentification research is a bad idea.

Pps. Did you like my work? Have comments or are you perhaps interested in another type of analysis? me@lukaszolejnik.com.