This sounds obvious doesn’t it and it also seems obvious that clients, and research participants, have rights to confidentiality that involve anonymisation of their data. The law and ethics are actually a bit more complex than that, it’s not just that we all have rights to confidentiality, it’s also that we should have rights to decide how our “personal data”, i.e. data that is identifiable from or about us, should be used as long as it remains identifiably ours. We should be able to give fully “informed consent” to what happens to that data (or to forbid it).

The catches are:
a) anonymising data may not be as easy as it sounds, and
b) going to extremes in deciding to protect all data that started out as identifiable, or only to all processing of that data with “informed consent” can cripple important ways to learn from experience to the benefit of future clients.

So the process of anonymisation is not always trivially easy and deciding when and how data that started as identifiable but has been anonymised, or, more usually “pseudonymised”, can be used is not easy to answer. The issues take us into national and international law as well as tough areas of ethics and of our own moral thinking.

In the general psychology, health and medical areas the 21st Century has seen an explosion in “open data” and in journals requiring that “open data” is submitted with papers. This has many very good aspects not least in reducing risks of fraud and in opening up opportunities for re-analyses and the application of new methods to existing data. However, the issues for the mental health and therapy world are serious as failure to protect confidentiality, to anonymise open data completely could be extremely serious. The issues overlap with the age old issues of protecting confidentiality in case reports and qualitative research: how do you remove information or change information without rendering the data misleading?

Details #

This such a huge issue that I won’t go into it in much detail but I will touch on the crucial distinction between anonymisation and pseudonymisation and then deanonymisation.


The name comes from the principle of replacing names with pseudonyms but it has become much more than that. The point is about being to identify that data came from one participant and not from any of the other participants for all the participants. Why is this not simply anonymisation?

Imagine you had repeated data from a lot of clients with only an ID and scores on a change measure at varying intervals you could fully anonymise the data by ripping out the ID code. The scores that remained could never (realistically) be used to recover the identity of the participants. The catch is that you would no longer know what data was from whom, you would have a lot of occasion 1 data, a lot of occasion 2 … etc. You could not connect the scores across the occasions. This is why we almost always need pseudonymisation not radical removal of IDs.

So we replace the clients’ ID values with some values using a one to one mapping so each remains distinct from every other but with the replacing values chosen so that no two participants share the same value and so that no “reverse mapping” from the new code back to the original ID other than by having the translation table. This is usually done now using a “hash” function that guarantees the 1:1 mapping and the impossibility of achieving reversing the mapping.


It used to be suggested that just replacing names and reducing dates of birth to a year of birth was sufficient anonymisation of data. Certainly it’s minimal good practice to do those things even within teams working on data. However, it has been recognised that where there are a number of other sociodemographic variables it can be possible to identify someone from those data: deanonymisation. This is a particular issue where data might be aligned with other data, something increasingly possible when more and more data about people is entirely public or can be found by determined people unconcerned about the law or ethics. This is a rapidly developing field however, if putting pseudonymised data into open access a good rule is to ensure that no combination of variables that might be publicly recognised should be shared by fewer than five people in the data: then “n>5 rule”. Ethnicity, gender and age group can easily be combined to create groups sharing the same values whose cell size is smaller than five. Processes to try to deanonymise data are sometimes called “jigsaw attacks”

Try also #

Hash functions/codes/mapping
Jigsaw attacks
n>5 rule
Open data

Chapters #

Chapter 6.

Online resources #

None at the moment.

Dates #

First created 10.viii.23.

Powered by BetterDocs