Pseudonymisation

Part of thorough anonymisation of data. Basically the same as using a pseudonym but there’s a bit more to it than that for datasets that may be shared (“open data”) or entirely public.

Details #

Generally in qualitative case reports the person is given a name but with a footnote saying that the name, and perhaps other details have been changed to protect the confidentiality of the person described. Changing the name is literally pseudonymising the report but the term is more associated with quantitative reports where multiple people’s data is presented and perhaps made available to others as open data or even in public data. Clearly the names of the participants are removed, however, if there might be multiple lines of data per participant in which case it’s vital that the data from one individual never gets muddled up with data from another person as it would be if the names column, or the primary ID column, were simply deleted. In this situation no-one makes up plausible, perhaps gender appropriate, pseudonyms for each participant, rather the names are replaced with a “hash code” on a 1:1 basis so each participant has only one hash code instead of their name/ID and the same hash code is never given to more than one participant.

A good hash code, something at the heart of modern cryptography as well as good data management is not just 1:1 but also “one way”, that is to say that it simply impossible, given the hash code, to decode it back to the name/ID that created it.

Here is an example, suppose we had a set of names that started:
“Chris Evans”
“Jo-anne Carlyle”
then hash coding those might result in two lines of 512 characters:
“0cfc0dd72718d366aea003dbc738c6557e280121b57cf1182d9815a4708f369e6537e2a79948ac2de90165492987f441ac1b719052b6f36b8eab650b6f1950bbb19d062a004ad70c7b9dabd08635376cc9…”
and
“88f6305230531ac5c6a9b87dfcc9a60c690e877497295ed00dd67deb391438b06ff4ae6721f59d5f685f72a1bc0dfffd17e20ca9eee77080543c8a3f5645bf5926af80762591821e23f7fb079cee0f4d49…”
where the “…” indicate that the characters continue to the full 512 characters.
It’s essentially impossible to get back from those 512 character hash codes of the names without knowing two “keys” and a password that were used in that encoding.

This is strong pseudonymisation. This used RSA encoding but there are a number of good, strong ways to do this and there were bad systems that can be decoded given enough data, computer power and hacking know how.

Pseudonymisation is only one part of good protection of confidentiality: see “smallest identical subset” for more on this.

Try also #

Confidentiality
Data protection
Open data
Smallest identical subset

Chapters #

Not covered explicitly in the OMbook.

Online resources #

None currently nor likely from me: I only use tools from experts to do these things and wouldn’t want to risk messing up those tools by some mistake of my own!

Dates #

First created 25.xi.23, latest tweaks 17.iv.23.

Powered by BetterDocs