Jigsaw “attack”

This should probably be “jigsaw re-identification” since “jigsaw attack” has now become more associated with a particular computer malware attack.

What I am talking about is the possibility of identifying one or more individual’s data from within a database that the database owners had presumed was safely pseudonymised. The term comes from the way that as you put different pieces of a jigsaw together the overall picture emerges.

Details #

“Jigsaw identification” is well recognised in law at least in the UK. See a good online article here if you want that angle on this (I do recommend reading it beyond the short initial legal/historical block).

The principle is that having pseudonymised data carefully it may still be possible to identify someone if you have some outside knowledge about the data. This is I don’t routinely make any of the datasets arising from clinical or therapy situations open. If someone knew that someone of very short height attended the service then having height data in the dataset, very normal if the service is working with people with eating disorders, would probably be enough to identify that one person from the dataset. If that person has a public profile and wants to keep their engagement with the service private then my releasing the dataset breaches their right to confidentiality. I sometimes use the (extremely probable) identification of the skeleton of king Richard III (of the UK) as a rather wild but memorable example of jigsaw identification: it needed the date (roughly) of his death, its rough location and the fact that he had a fairly marked scoliosis (“hunchback” in his case) to make the identification probable (see https://le.ac.uk/richard-iii/identification for a lot of detail on this).

See also https://www.theguardian.com/science/2026/mar/14/confidential-health-records-exposed-online-uk-biobank for what strikes me as a depressing example of a highly funded research project being incredibly naïve about the potential for deanonymisation of research data in the modern era in which so much data is available about us on the internet (legally or not), data that can be “jigsawed” together to identify us.

Attention to the “smallest identical subset >= 5” rule when considering the risks of sharing a dataset you own is some protection against jigsaw re-identification.

Try also #

Anonymity & anonymisation
Confidentiality
Data protection
Dataset
Open data
Pseudonymisation
Re-identification
Research ethics
Smallest identical subset

Chapters #

Not covered in the OMbook.

Online resources #

One of my shiny apps: https://shiny.psyctc.org/apps/Hashing_IDs/ will hashcode pretty much any set of ID codes for you. However, as you can see, that’s only part of good data protection.