Pseudonymisation

Part of thorough anonymisation of data. Basically the same as using a pseudonym but there’s a bit more to it than that for datasets that may be shared (“open data”) or which may be entirely public.

Details #

Generally in qualitative case reports as in news reportage, people are given a name but with a footnote saying that the name, and perhaps other details have been changed to protect the confidentiality of the person described. Changing the name is literally pseudonymising the report. However, the term is more associated with quantitative reports where data from multiple are analysed presented and perhaps made available to others as open data. Clearly the names of the participants are removed as are dates of birth, addresses, telephone numbers and Email addresses. However, pseudonymisation of quantitative data is a term particularly associated with the situation in which there might be multiple lines of data per participant, say repeated completions of measures. In such situations it is vital for the data analysis that the data from one individual never gets muddled up with data from another, that an individual’s data can be tracked across the different records and never muddled up with data from another person.

In this situation, often with hundreds or even millions of participants no-one makes up plausible, perhaps gender appropriate, pseudonyms for each participant and we know that names aren’t unique to individuals (in my 20s I discovered from a letter advising me that my National Insurance number had changed, that there were two people named Christopher David Evans born in the UK on 25.ii.57!) What is done is to make sure that each participant has some ID code unique to them but then usually to replace that ID code with a “hash code” on a 1:1 basis so each participant has only one hash code instead of their name/ID and the same hash code is never given to more than one participant and original ID codes never go outside storage controlled by the researchers.

A good hash code, something at the heart of modern cryptography as well as good data management is not just 1:1 but also “one way”, that is to say that it simply impossible, given the hash code, to decode it back to the name/ID that created it.

Here is an example, suppose we had a set of names that started:
“Chris Evans”
“Jo-anne Carlyle”
then hash coding those might result in two lines of 512 characters:
“0cfc0dd72718d366aea003dbc738c6557e280121b57cf1182d9815a4708f369e6537e2a79948ac2de90165492987f441ac1b719052b6f36b8eab650b6f1950bbb19d062a004ad70c7b9dabd08635376cc9…”
and
“88f6305230531ac5c6a9b87dfcc9a60c690e877497295ed00dd67deb391438b06ff4ae6721f59d5f685f72a1bc0dfffd17e20ca9eee77080543c8a3f5645bf5926af80762591821e23f7fb079cee0f4d49…”
where the “…” indicate that the characters continue to the full 512 characters.
It’s essentially impossible to get back from those 512 character hash codes of the names without knowing two “keys” and a password that were used in that encoding.

This is strong pseudonymisation. This used RSA encoding but there are a number of good, strong ways to do this and there were bad systems that can be decoded given enough data, computer power and hacking know how.

Pseudonymisation is only one part of good protection of confidentiality: see “smallest identical subset” for more on this and “re-identification” and “jigsaw re-indentification”. You might want to look at https://www.theguardian.com/science/2026/mar/14/confidential-health-records-exposed-online-uk-biobank for a depressing example of a highly funded research project being incredibly naïve about the potential for deanonymisation of research data.

Try also #

Confidentiality
Data protection
Open data
Smallest identical subset

Chapters #

Not covered explicitly in the OMbook.

Online resources #

None currently nor likely from me: I only use tools from experts to do these things and wouldn’t want to risk messing up those tools by some mistake of my own!