Smallest identical subset

This has become an important idea in relation to confidentiality, anonymity and pseudonymisation of data.

Details #

The basic idea is that in an anonymised dataset perhaps containing a number of demographic variables the smallest identical dataset is exactly what it says: the size of the smallest group of people whose anonymised data is exactly the same as the other members’ values for the same data. If the number of people in this subset is one it is in principle possible for the person to be identified. I give the silly and very culturally located example of a dataset of medieval people where fields included:
nobility: Y/N
gender: M/F (it probably would have been binary)
physical disability: Y/N
nature of disability: free text
The smallest identical subset might be well above one for the first three variables but add “hunchback” as the value of the last variable and you are probably down to n = 1 and able to map the other data (Machiavellianism score?) to Richard of Bordeaux, a.k.a. Richard II (6 January 1367 – c. 14 February 1400)!

It’s a silly example and I’m sure much more 21st C ones can be created but it illustrates the issue and also the way in which free text values can sometimes reduce the size of the smallest identical subset in the data rapidly, often to just one, even sometimes when very few variables are stored.

A common rule of thumb is to keep the size of the smallest minimum subset to five or larger if datasets are public. However, even then, a complicating issue may be how much collateral information someone may have. For example, staff, ex-staff, ex-fellow clients may be able to put a smallest identical subset together with other information they may have to be able to identify say, a public figure’s data in a publicly shared routine clinical dataset.

With large, multi-variable datasets, it can be a significant computational challenge, and way beyond human wetware, to identify the smallest identical subsets in data. There is computer software to take on the task though. I’ll link to some examples at a later date.

Try also #

Anonymity & anonymisation
Confidentiality
Dataset
De-anonymisation (points to anonymisation!)
Open data
Pseudonymisation
Re-identification
Research ethics

Chapters #

The general issues of the ethics of managing data well are mentioned in various chapters but these specifics were way beyond the scope of the book … but so important that I am trying to put a good collection of pertinent terms into this glossary.

Online resources #

None currently

Dates #

First created 24.xi.23.

Powered by BetterDocs