Sometimes n=4 is enough

Back in 2002, with two colleagues, I published a paper:

Evans, C., Hughes, J., & Houston, J. (2002). Significance testing the validity of ideographic methods: A little derangement goes a long way. British Journal of Mathematical and Statistical Psychology, 55(2), 385–390.

Contact me if you’d like a copy!

In the paper we used a bit of old maths to show a simple way of validating idiographic/ideographic data. The data in question were principal component plots of person repertory grids created by six patients in a prison psychotherapy group (who gave permission for this). These plots are completely personal to each individual. They depend on the elements the patients chose for the roles in the grid (e.g. victim, brother, boss, ideal self) and the constructs on which they chose to rate them, (e.g. aggressive, caring, selfish) and the ratings they gave each element.

Those principal component plots represent the best two dimensional plot of all the ratings for each grid. Julia (Houston) had asked the two therapists if they could map the six plots back to the patients. As we said in the paper:
Both therapists matched four of the six pre-therapy grids successfully; one therapist matched all six post-therapy grids and the other matched three of the six. We sought to assess the probabilities that these matchings might have been achieved by chance alone. This paper reports the logic which shows that this probability was lower than a conventional criterion of significance ( p < 0.05) where four or six grids were matched correctly.

This is a completely general method, the steps are:

  • Take the data, the idiographic information you have from n individuals (n >= 4)
  • Shuffle the data
  • Present them to someone who knows the people who created the data
  • Ask judge to match data to people
  • The score is the number of correct matches
  • If the score is 4 or more, regardless of n, the chances of this being achieved by chance matching alone is p < .05, i.e. statistically significant at the usual criterion for that.

Here’s the same in slightly tongue in cheek cartoon format.

Steps 1 and 2

Step 3

Steps 4 and 5

In the cartoon example the only thing that distinguishes the six sets of idiographic data is actually their colour (yes, this is not a very serious example). The judge successfully mapped four of the six which has probability (that they would score four or even better by chance alone) of p = .022 (see lookup table at

That is clearly less than .05 so it meets the conventional criterion of “statistical significance”, i.e. by that convention we reject the null hypothesis that no information is contained in the data and accept the alternative that, though the data were idiographic, and the way the judge did the mapping may have been unique to that one judge and their particular knowledge of the six people (i.e. an idiographic judgement on idiographic data), it has some validity.

To most of us who are neither mathematicians nor statisticians it may seem utterly counter-intuitive that regardless of the number of objects a score of four or more is always good enough for p < .05. Perhaps it’s so counter-intuitive perhaps that we switch off our own judgement and either decide that the method was published in a good peer-reviewed journal and so must be correct (it is!), or simply believe it cannot be correct.

However, it’s not as counter-intuitive as it may first seem: as the n goes up the number of ways of mapping just four of them correctly does go up rapidly as this table shows.

Number of ways of getting four correct from n objects

wdt_ID n Number of ways of getting four correct from n
1 4 1
2 5 0
3 6 15
4 7 70
5 8 630
6 9 5,544
7 10 55,650
8 11 611,820
9 12 7,342,335
10 13 95,449,640

However, the total number of ways of permuting the n is also rocketing up and faster:

Total number of ways of permuting n objects

wdt_ID n PossibleWays
1 1 1
2 2 2
3 3 6
4 4 24
5 5 120
6 6 720
7 7 5040
8 8 40320
9 9 362880
10 10 3628800
11 11 39916800
12 12 479001600
13 13 2147483647
14 14 2147483647

The two accelerations pretty much cancel out and so keep the probability of getting four or more correct by chance alone below .05 for any n as shown below.

Significance of scoring four or more

wdt_ID n Total possible permutations score Number of ways of getting four correct p (for four or more correct)
1 4 24 4 1 0.04
2 5 120 4 0 0.01
3 6 720 4 15 0.02
4 7 5,040 4 70 0.02
5 8 40,320 4 630 0.02
6 9 362,880 4 5,544 0.02
7 10 3,628,800 4 55,650 0.02
8 11 39,916,800 4 611,820 0.02
9 12 479,001,600 4 7,342,335 0.02
10 13 2,147,483,647 4 95,449,640 0.02
11 14 2,147,483,647 4 1,336,295,961 0.02

This shows how the p value for various scores (on the y axis) stabilises as the number of objects, n, goes up (x axis).

Here’s the same data but with the p values on a log10 scale on the y axis.

Why kappa? or How simple agreement rates are deceptive

Created 24.i.22

I did a peer review of a paper recently and met an old chestnut: that the inter-rater agreement reported was good because the simple agreement rates were “good”. This is nonsense and that has been written about for probably a century and alternative ways summarising agreement rates have been around for a long time. Jacob Cohen invented his “chance corrected” “coefficient of agreement for nominal scales”, kappa in 1960 (Cohen, 1960). That made me think it might be useful to have a blog post here, perhaps actually several, linking with demonstrations of the issues in my “R SAFAQ” (Self-Answered Frequently (self) Asked Questions” (a.k.a. Rblog).


The issue is very simple: if the thing that is rated is not around 50:50 in the ratings, then agreement even by chance is going to be high. Let’s say two raters are asked to rate a series of photos of facial expressions for the presence of “quizzically raised eyebrows” and the rate of photos that look even remotely quizzical they are given is only 10% and let’s suppose they are told that the rate is about 10% and use that information.

Now if they have absolutely no agreement, i.e. only chance agreement about what constitutes a “quizzically raised eyebrow” they may well still each rate about 90%. In that case by chance alone rater B will rate as quizzical 10% of the photos that rater A rated as quizzical: rate of agreement 10% * 10% = one in a hundred, 1% agreement. However, rater B will rate as not quizzical 90% of the 90% of photos that rater A rated as not quizzical: rate of agreement 90% * 90% = 81%. So their raw agreement rate is 82% which sounds pretty good until we realise that it arose by pure chance. Here’s an aesthetically horrible table of that for n = 100 and the perfect chance level of agreement. (In real life, sampling vagaries mean it wouldn’t be quite as neat as this but it wouldn’t be far off this.)

n Rated quizzical by rater B Rated NOT quizzical by rater B Row totals
Rated quizzical by rater A 1 9 10
Rated NOT quizzical by rater A 9 81 90
Column totals: 10 90 100

That’s why Cohen invented his kappa as a “chance corrected” coefficient of agreement. It actually covers ratings with any number of categories, not just binary “quizzical/not-quizzical” and there are arguments that it’s an imperfect way to handle things but it is easy to compute (look it up, the wikipedia entry, as so often for stats, takes some beating). Pretty much any statistics package or system will compute it for you and there are online calculators that will do it too (, and were the first three that gurgle found for me, the last has some advantages over the first two.)

The arguments against it are sound by fairly fine print and it’s orders of magnitude better than raw agreement. Kappa for the chance agreement in that table is zero, as it should be.

See it for different rates of the rated quality from R

This plot illustrates the issue pretty clearly. The x axis has the prevalence of the quality rated (assuming both raters agree that). The red line shows that raw agreement does drop to .5, i.e. random, 50/50 agreement, where the prevalence is 50% but that it rises to near 1, i.e. to near perfect agreement, as prevalence tends to zero or 100%. By contrast, and as a sensible agreement index should, kappa remains on or near zero across all prevalence rates.

See my “Rblog” or “R SAFAQ entry about this for more detail and plots.


Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

Against score levels?

Created 19.x.21

This comes out of receiving intermittent requests at my CORE site for “the graph with the colours” or “the graph with the levels” and for “the scoring levels”, most recently for the GP-CORE and the LD-CORE. I always explain that I don’t provide these. I’m posting about the issue here not on the CORE site as the issues are general.

People are looking for things like this:

YP-CORE blank graph

Or this:

CORE-10 blank graph

The first was a second page of early YP-CORE forms, the other is from Connell, J., & Barkham, M. (2007). CORE-10 User Manual, Version 1.1. CORE System Trust & CORE Information Management Systems Ltd. I think I’m within my rights to post both here as a CST trustee, however I wasn’t involved in creating either of them (as will become clear!)

They’re obviously appealing so why am I against them? It’s partly because I understand ways that we can dichotomise scores by defining cutting scores that separate “clinical” from “non-clinical” scores. There are a number of ways of doing this but the CORE approach has always been the Jacobson et al. “Clinically Significant Change (CSC)” method c. There are arguments for other methods but that is fairly easy to understand.

Part of my problem is that I have no idea how we can establish four other cutting points to get a six level (“sexotomy”?) of the possible scores. In the manual Connell and Barkham say:

“A score of 10 or below denotes a score within the non-clinical range and of 11 or above within the clinical range. Within the non-clinical range we have identified two bands called ‘healthy’ and ‘low’ level distress. People may score on a number of items at any particular time but still remain ‘healthy’. Similarly, people may score in the ‘low’ range which might be a result of raised pressures or particular circumstances but which is still within a non-clinical range. Within the clinical range we have identified the score of 11 as the lower boundary of the ‘mild’ level, 15 for the ‘moderate’ level, and 20 for the ‘moderate-to-severe’ level. A score of 25 or over marks the ‘severe’ level.”

Connell & Barkham, 2007, p.10.

I like the honesty of “we have identified” but I can find nothing in the manual to say how those cutting points were identified.

So what’s going on here? I am becoming uneasy just to explain to people that I don’t provide such levels or those graphs as I suspect the cutting points are essentially aribtrary. I think it’s time to wonder why they appeal, why do authors and publishers of measures provide them (it’s not just CORE, many other measures do this too).

I think one useful answer is that, like “clinical/non-clinical” cutting points they paper over a general unease about what we’re doing with these numbers. They appear to answer the question: what do these scores mean? What do they mean?

Well of they’re just the numbers we create from whatever scoring system the measure uses to convert the response choices the person completing the measure chose. However that doesn’t answer what they “mean”.

We could start by asking what the person completing the measure meant them to mean: did that person know the number they were creating? For some measures (most but not quite all CORE measures) the person may be able to see the numbers allocated to the answer options. For fewer measures (most but not quite all CORE measures on paper) the person may be able to see the actual scoring system at the end of the measure so it’s possible that some people consciously create their score. The person may mean us to see a score of 1.3. However, I suspect that’s very rare. I suspect it’s commoner that someone might calculate their own score and increasingly app or online presentations of measures may do this automatically so the person completing the measure may see a score, say 1.4. Depending on the system they might or might not then be able to go back and change their score. The CORE-bots are one example of a systme that shares the scores with the person completing the measure, however providing scores is probably becoming the norm. (No, not these CoreBots, these CORE-bots!)

Even if the person creating the score knew their score, even in the very exceptional situation (usually forensic?) in which they knowingly created the score they wanted, is this a communication of a number in one mind to a number in another mind? Are we nearer to what the score “means” to someone seeing it other than the person who created it.

I am going to side step the increasingly common situation in which there is no receiving mind: where the number just goes into a database with no wetware processing, no other mind giving it attention within the someone receiving it. I am also going to sidestep my own situation in which I receive scores. 99.999% of the scores I receive I do as a data processor. Most often, and for good data protection/ confidentiality reasons, I have no idea who chose the answers they did, who created the item numbers they did and hence the score.

The requests I get for those empty graphs, for “the levels” are I think all coming from settings in which there is a receiving mind who has some relationship with the person who created the scores. Why am I opposed to giving them levels or nice empty graphs like the ones above?

I entirely approve of graphing someone’s scores against time: that is one way of putting meaning on the scores and would love to provide ways people could do that easily and approve of the many systems that do provide that. To me such graphing retains the simple numbers but for most of us converting the numbers to points on a graph makes it easier for to process change. If I am shown 1.6, 0.9, 1, 1.2, 1, 0.5, 0.2, 2.8, 2, 1.2, 1.1, 1.6, 1.7, 1.9, 2, 1.5 I don’t find it has much “meaning” even if I know these are “clinical” scores on the CORE-10, i.e. the total of the item scores, each 0 to 4 across the ten items. However, that little run of numbers can create this.

All that’s happened there is that numbers have been converted into distances with two additions: there is a red reference line and the subtitle tells me that the line marks the score of 15 which was the 90th centile of scores from a set of 197 people, i.e. the score that as nearly as possible given the 197 baseline scores, has 90% of the 197 scoring below it and 10% scoring above it. Now these numbers take on meaning for me and it seems to me that this person’s scores start fairly high for the group (volunteers in a support service during the early stages of the covid pandemic). Her/his scores vary to week 7 but have dropped markedy then they rocket up … you read on.

For me this is reading meaning into the numbers: I can explain it, to me it’s plausible and if you don’t think it’s plausible you are completely at liberty to disagree with the mapping and read your own meaning into the data.

I entirely agree with the wish to do something with questionnaire score numbers that we hope will help us interpret, understand, the numbers. That’s what I try to do all the time with statistics or graphs or both. I just don’t agree with converting the scores to words like “mild”, “moderate” or “severe” as for me there must always be a logic to the conversion and one that I think I understand and that I can try to explain.

I use “painting by numbers” provocatively. You could argue that converting a number to a colour is as logical as converting it to a distance. However, our visual system means it really isn’t the same. Here are those numbers as colours.

Even without the problem that about 10% of the male population who have red/green colour blindness won’t see what those of us with normal colour vision see there, it’s simply not interpretable as the earlier plot was. Had I given a colour gradient fill behind the first plot I would have simply added an implication, perhaps “safe” versus “dangerous”, actually, I would have added words even without using them.

That’s my real objection to these levels: converting numbers to words, “mild” and “severe” for example, or just giving numbers colours from green to red is covertly forcing a set of meanings onto the numbers. I understand the urge, I just think it’s trying to reassure us that the numbers are as simple as that suggests. I believe they’re not.

Hm, I can see that this post actually follows on from my last about “blended and layered” research. I can see now that it leads into some others, here and on my CORE site, which are brewing, and these are issues that Jo-anne (Carlyle) and I develop in our book which is coming out through SAGE any day now we are told.

The glorious daily Wikipedia feed introduced me to Daniel J. Boorstin, well, he died in 2004 so sadly we didn’t get to have a drink together; however, it did learn of his glorious comment “I write to discover what I think. After all, the bars aren’t open that early.” Genius! (If you’re not a genius, quote the people who are or were!!)

Warm acknowledgements

The data behind the graphs come from an excellent piece of work in Ecuador in which late trainee and qualified psychologists volunteered to provide telephone support to families struggling with deaths and other direct effects of coronavirus and/or with the lockdown. Dr. Clara Paz’s university UDLA, hm, my university now as ever seemed to get things right and go beyond the minimum and encouraged the volunteers to fill in the CORE-10 weekly and scores were shared with their supervisors to put meaning on changes like those in the graph above. There is more about the study in this post in the Spanish CORE subsite (and hence in Spanish!) and the article about the work

Oh, and the headline image is of sun rising here this morning: that’s just glorious colour!

Blended & layered research: Avdi & Evans 2020

Created 11.ix.21.

Yesterday I went, virtually, to Malta to the 8th conference of Qualitative Research in Mental Health, QRMH8 to its loyal friends. I was there because Professor Avdi, from Aristotle University in Thessaloniki gave a presentation of the work she and I did for the paper in the title of this post: Avdi, E., & Evans, C. (2020). Exploring Conversational and Physiological Aspects of Psychotherapy Talk. Frontiers in Psychology, 11, 591124. (Open access at

I’m very proud of that paper as it was a genuine attempt to try to do more than a “mixed methods” piece of work, i.e. a mix of qualitative and quantitative methods. The paper came out of work Evrinomy had done within the Relational Mind research project which was a Finnish led collaborative project using both qualitative and quantitative methods to explore that title: minds in relationship and perhaps constitute by relationships. I’ve been following their qualitative work, and, intermittently, the QRMH conferences, for some years now and Evrinomy and I have know each other for many years, starting with the Greek translation of the CORE-OM co-led by myself and Dr. Damaskinidou: a good friend to both of us who introduced us through that work.

Evrinomy approached me some years back, 2015 or 2016 perhaps as I think I was still in clinical work. At that point she was asking my views on the work she was doing with colleagues in Thessaloniki trying to link the physiological arousal indicators with the processes in couple and individual therapies in which therapists and clients wore heart and respiratory rate recorders. That led to me not being terribly useful to a very tolerant PhD student supervised by Evrinomy, Anna Mylona on work she was doing linking “rupture” classification of the transcripts from 2018 to 2020 and that led, in turn to this paper that Evrinomy and I got done last year.

While Evrinomy, with I think some genuinely useful input from me worked up a fascinating conversation analytic (CA) unpicking of the session, we, well probably I is more accurate, worked through a series of quantitative tools to look at the changes in ASV, the arousal measure, through the session we were dissecting. I taught myself a lot about time series analyses and got to understand PDC: partially directed coherence analysis which had been a method her Thessaloniki colleagues (from neurophysiology) had advocated. In the end we agreed only to use very simple plots of the data against the eight “Topical Episodes” (TEs) that emerged from the CA. That led to plots like these. (Click on them to get the full plot.)

If you’re interested to read more, particularly the excerpts and CA, do look at the paper. As an example of truly blended, rather than simply mixed, research it’s not sophisticated but what I think did emerge was what happens when a largely qualitative researcher (Evrinomy is seriously experienced and skilled) and a quant geek like myself, but who both share a clinical background try to complement each other. It’s not particularly clear in the paper (it’s big and quite dense as it is!) but we each learned a lot about blending.

Three simple methodological things emerged for me:
1. one huge strength of statistical quantitative research is the ability to formulate “objective” tests to tell us whether we appear to have non-random things in our data;
2. however, very often the purity of those models is not really a good model of how the actual data arose and sometimes “descriptive, exploratory and ‘estimating'” statistical methods may be more robustly useful;
3. if your methods are so sophisticated, complex and unfamiliar that practitioners will be essentially reduced to the role of audience at a display of magic we have an odd relational mind relationship being created between researchers/authors, readers (practitioners) and the data.

#2 was clearly the case for our data and a lot of the sophisticated things I had hoped might be useful were clearly stretching the relationship between data and model and others, to me the PDC method, fell into that “this is magical” #3 so we ended up with very simple plot methods but tried to keep a genuine blending of quantitative and qualitative data.

Perhaps more interestingly and importantly, this pushed us into a lot of thinking about the fact that methodological issues like those, or any of the many qualitative methodological choices, actually sit on top of epistemological choices. (Quick acknowledgement to Dr. Edith Steffen now in Plymouth who, when we overlapped in the Univesity of Roehampton, challenged me to take epistemology seriously despite the brain ache that causes me!)

There is an odd polarisation that goes with the general qual/quant polarisation in research about minds: qualitative papers almost always have some statement of epistemological position and, largely implicitly, locate that in the mind of the author(s) exposing it for consideration by the readers; by contrast, epistemological position statements are hardly ever seen in quantitative papers. This has the effect of leaving the reader of quant papers to assume the paper isn’t arising from an authorial mind or minds, but in some abstract “reality”: in fact the papers claim truth value in what seems to me to be a completely untenable “empirical positivist” position. I’m left wondering if we could blend our methods so much more usefully if started to insist that all papers have at least a one line statement of epistemological position. I’m trying to make sure I put mine into all my papers now and to insist that authors should put that into their work when I’m peer reviewing. I think it’s going to be a long time before this might become a norm and don’t think we’ll tap the real power of genuinely blended method instead of often very tokenistic mixed methods.

Onwards! Oh, here’s a souvenir from my only non-virtual visit to Malta, in 2018.

I do like the blending of languages and a clear plot of the message! Malta is a fascinating place. Perhaps I’ll get round to doing the intended second blog post in my personal blog that was supposed to complement the first, rather negative one. If you’re wondering about the choice of header image: it’s a recent image from the terrace outside my current workplace and I thought the juxtaposition of the moutains through atmospheric haze and the 1970 brutalist balcony and wooden fences had something of the flavour of blending qual and quant! For more on living there you might try “Squatting (with my coffee)“!

NICE consultation 2021

[Written 9.vii.21]

NICE is having a consultation. As the Email I got says:

We have now embarked on the latest phase of this user research. I’d like to invite you to contribute so we can better understand your views and experiences of NICE. Your feedback is truly important and will help us continue our journey to transform over the next 5 years.
The survey is open until Friday 16 July 2021. So, please do take 10 minutes to share your views before then.
Complete our short survey
Gillian Leng CBE
Chief executive, NICE

So I started it and got to “Please explain why you feel unfavourably towards NICE.” which had a nice big free text box. So I typed in my set of, I think fair and carefully thought out criticisms (below) and hit the button to move on to the next question and got this.

We’re sorry but your answer has too much text. The current length of your answer is 3944 and the maximum length is 1024 characters. Please change your answer and try again.

Wonderful! No initial warning that only 1024 characters were allowed, no warning as you approach 1024, no block when you hit 1024. Terrible design!

For what it’s worth, these were my 3944 words.

What was originally a system to provide information has morphed relentlessly into something that is used in the commoditisation of health care to dictate what practitioners should do. It is so preoccupied, to a large extent understandably, in containing exploding pharmaeutical costs, that it is very focused on RCT evidence used to assess cost effectiveness. That’s not bad for those pharmaceutical interventions that can be given double blind but even there generalisabillity appraisal is poor with a dearth of attention to post-marketing, “practice based evidence” to see how RCT findings do or do not generalise. For most interventions, all psychosocial interventions, where double blind allocation is impossible, this is crazy and leads almost all research funding to be diverted into RCTs “as they have political influence” but where their findings are such that it is essentially impossible to disentangle expectancy/placebo/nocebo effects from “real effects” (there is an interesting argument about that separation but there is some meaning in it). This goes on to make it impossible with your methodologies to evaluate complex real world interventions including psycho-social ones, impossible to compare those with pharmaceutical or surgical/technological ones and impossible to evaluate mixed interventions.

Decisions are theoretically about quality of life but, at least in the mental health field, all work I have seen has been based on short term symptom data and makes no attempt to weight in what QoL and functioning data does exist. This is not a new issue: McPherson, S., Evans, C., & Richardson, P. (2009). The NICE Depression Guidelines and the recovery model: Is there an evidence base for IAPT? Journal of Mental Health, 18, 405–414. showed this clearly 12 years ago (yes, I contributed to that). In addition, foci are not always, but are generally, on diseases leading to a neglect of the growing complexities of multi-diagnostic morbidity and of the whole complex interactions of mind and body even when there are crystal clear, organic, primary disorders (Diabetes Mellitus and cancers are a classic example of clear organic pathologies where the complexities of how individuals and families handle the same organic pathology make huge differences in problem and QoL trajectories). In the mental health domain, to make a rather crude physical/mental distinction, there are crystal clear diagnoses of organic origin (Huntingdon’s Disease and a tiny subset of depression, anxiety disorders, much but not all intellectual disabilities and some psychotic states) but the disease model, certainly in simple “diagnosis is all and dictates treatment to NICE guidelines” is often more of a handicap than an aid.

That focus also leaves NICE almost irrelevant when it has to address “public health attitude” issues like obesity, diet more generally, smoking, alcohol and other substance abuse and spectacularly at the moment, attitudes to vaccination and social interventions to minimise cross-infection. Again, cv-19 has exposed this, and the slowness of NICE, horribly, but all the warnings have been there for decades.

In addition, NICE processes come across as increasingly smug (routine Emails I get from NICE long ago lost any sense that there could be any real doubts about your decisions) and the history of the recent depression guideline should be a marker that the good law project should turn from the government to NICE processes. From what I see of that, NICE has come across as opaque and more concerned to protect its processes than to recognise the huge problems with the particular emerging guideline but really more generally.

Why waste time typing all this: this is all so old and has so consistently developed to avoid and minimise problems that I suspect this will be another process of claiming to have been open and listening but changing little.

New developments here

Created 14.iv.21

Oh dear, about 16 months since I last posted here. Still I hope that featured image above gives a sense of spring arriving! I guess some of those months have been pretty affected by the coronavirus pandemic. There’s a little bit more about how that impacted on me and how I spent a lot of those months high up in the Alps, very protected from cv-19. During this time I have been working very hard and been fairly successful getting papers I like accepted (see publication list and CV)

In the last month I have protected some work time away from Emails, data crunching and paper writing and come back to web things. That has resulted in:

  1. My SAFAQ or Rblog. This is a set of Self-Answered Frequently Answered Questions (hence SAFAQ) and is the best way I have found to present how I use R and allows me to do that in a way that I can’t here in WordPress.

That is quite closely linked with:

  1. The CECPfuns R package. R ( a brilliant, open source, completely free system for statistical computation than runs on pretty much any Linux, on Macs and on Windows. It is partly made up of packages of functions and I have written one that I hope will grow into a useful resource for people wanting to use R for therapy and psychology work but wanting a fairly “newbie friendly” rather than R geeky hand with that. It complements the SAFAQ/Rblog. is a web site built out of the package documentation and all the geeky details are at

Those are both developing quite fast with the latter getting updates daily, sometimes more often, and the former getting new items more often than once a week. I suspect that now I have announced them here, I may do intermittent posts here that just give an update about one or both of those and they will get to be linked more into pages here. There are two more key developments coming up:

  1. My own shiny server here that will provide online apps for data analysis and providing explanations of analytic tools.
  2. A book “Outcome Measures and Evaluation in Counselling and Psychotherapy” written with my better half, Jo-anne Carlyle, that should be coming out through SAGE in late November (and they have just sent us the first proofs exactly to their schedule so perhaps I should start believing that it will come out then.) That is aimed at demystifying that huge topic area and, we hope, making it easier for practitioners both to understand, and critique, it, and to start doing it. That will lean heavily on SAFAQ pages and the apps on the shiny server.

And now, in radical contrast to the featured/header image, something completely different:

An 8Gb Raspberry Pi 4

That’s sitting on my desk, about the size of a small book. It’s a local version of the system that, I hope, will host the shiny server!

Ethics committees and the fear of questionnaires about distress

Created 10.xii.19

Perhaps this post should be a post, or even an FAQ, in my CORE web site, but then I fear it would be taken too formally so it’s here. However, I’ll put a link to this from the CORE blog: this thinking started with one of a sudden slew of Emails I’ve had coming to me via the CORE site.

I won’t name names or Universities but I will say that it came from the UK. I think it could easily have come from other countries but I have the general experience that many countries still have less of this problem, seem less fearful of asking people about unhappiness or even self-destructiveness than many UK ethics committees seem to be.

The specific problem is the idea that if a research project asks people, particularly young people: teenagers or university students, about distress, and particularly about thoughts of self-harm or suicide, that there’s a terrible risk involved and the project shouldn’t happen. This sometimes takes the form of saying that it would be safer only to ask about “well-being” (or “wellbeing”, I’m not sure any of us know if it needs its hyphen or not).

A twist of this, the one that prompted this post, is the idea that the risk might be OK if the researcher using the measure, offering it to people, is clinically qualified or at least training in a clinical course. That goes with a question I do get asked fairly regularly about the CORE measures: “do you need any particular qualifications to use the measures?” which has always seemed to me to be about the fantasy that if we have the right rules about who can do what, everything will be OK.

This post is not against ethics committees. I have worked on three ethics committees and have a huge respect for them. They’re necessary. One pretty convincing reading of their history is that their current form arose particularly out of horrors perpetrated by researchers, both in the US and the UK, and also in the concentration camps. Certainly the US and UK research horrors did lead to the “Institutional Review Board (IRBs)” in the States and the “Research Ethics Committees (RECs)” in the UK. Those horrors that really were perpetrated by researchers, particularly medical researchers, but not only medical researchers, are terrifying, completely unconscionable. It’s clearly true that researchers, and health care workers, can get messianic: can believe that they have divine powers and infallibility about what’s good and what’s bad. Good ethics committees can be a real corrective to this.

Looking back, I think some of the best work I saw done by those ethics committees, and some of my contributions to those bits of work, were among the best things I’ve been involved with in my clinical and research careers so I hope it’s clear this isn’t just about a reasearcher railing against ethics committees. However, my experience of that work brought home to me how difficult it was to be a good ethics committee and I saw much of the difficulty being the pressure to serve, in the Freudian model, as the superego of systems riven with desires, including delusional aspirations to do good through research. I came to feel that those systems often wanted the ethics committee to solve all ethical problems partly because the wider systems were underendowed with Freud’s ego: the bits of the system that are supposed to look after reality orientation, to do all the constant, everyday, ethics they needed done.

In Freud’s system the superego wasn’t conscience: a well functioning conscience is a crucial part, a conscious part, of his ego. You can’t have safe “reality orientation” without a conscience and, as it’s largely conscious, it’s actually popping out of the top of his tripartite model, out of the unconscious. His model wasn’t about the conscious, it was about trying to think about what can’t be thought about (not by oneself alone, not by our normal methods). It was about the “system unconscious”: that which we can’t face, a whole system of the unreachable which nevertheless, he was arguing, seemed to help understand some of the mad and self-destructive things we all do.

In my recycling of Freud’s structural, tripartite, model, only his id, the urges and desires is unequivocally and completely unconscious, the superego has some conscious manifestations and these do poke into our conscious conscience, and the ego straddles the unconscious (Ucs from here on, to speed things up) and the conscious. (I think I’m remembering Freud, the basics, correctly, it was rather a long time ago for me now!)

I’m not saying that this model of Freud’s is correct. After all, Freud with theories, was rather like Groucho Marx with principles, they both had others if you didn’t like their first one …) What I am arguing, (I know, I do remember, I’ll come back to ethics committees and questionnaires in a bit) is that this theory, in my cartoon simplification of it, may help us understand organisations and societies, even though Freud with that theory was really talking about individuals.

As I understand Freud’s model it was a double model. The was combining his earlier exploration of layers of conscious, subconscious and Ucs with this new model with its id, superego and ego. They were three interacting systems with locations in those layers. With the id, ego and superego Freud was mostly interested in their location in unconscious. Implicitly (mostly, I think) he was saying that the conscious (Cs), could be left to deal with itself.

That makes a lot of sense. After all consciousness is our capacity to look after ourselves by thinking and feeling for, and about, ourselves. To move my metaphors on a century, it’s our debugging capability. Freud’s Ucs, like layers of protection in modern computer operating systems, was hiding layers of our functioning from the debugger: our malware could run down there, safely out of reach of the debugger.

The id, superego, ego model is surely wrong as a single metatheory, as a”one and only” model of the mind. It’s far too simple, far too static, far too crude. Freud did build some two person and three person relatedness into it, but it was still a very late steam age, one person, model and desperately weak on us as interactional, relational, networked, nodes, it was a monadic model really.

However, sometimes it fits! My experience on those committees, and equally over many more years intersecting with such committees, is that they can get driven rather mad by the responsibilities to uphold ethics. They become, like the damaging aspects of Freud’s model of the individual’s superego, harsh, critical, paralysing, sometimes frankly destructive. The more the “primitive”: rampant desire (even for good), anger and fears of premature death and disability gets to be the focus, the more they risk losing reality orientation, losing common sense and the more the thinking becomes rigid. It becomes all about rules and procedures.

The challenge is that ethics committees really are there to help manage rampant desire (even for good), anger and fears of premature death and disability. They were created specifically to regulate those areas. They have an impossible task and it’s salutory to learn that the first legally defined medical/research ethics committees were created in Germany shortly before WWII and theoretically had oversight of the medical “experiments” in the concentration camps. When society loses its conscience and gives in to rigid ideologies (Aryanism for example) and rampant desires (to kill, to purify, to have Lebensraum even) perhaps no structure of laws can cope.

OK, so let’s come back to questionnaires. The particular example was the fear that a student on a non-clinical degree course might use the GP-CORE to explore students’ possible distress in relation to some practical issues that might plausibly not help students with their lives, or, if done differently, might help them. The central logic has plausibility. I have no idea how well or badly the student was articulating her/his research design, I don’t even know what it was. From her/his reaction to one suggestion I made about putting pointers to health and self-care resources at the end of the online form, I suspect that the proposal might not have been perfect. Ha, there’s my superego: no proposal is perfect, I’m not sure any proposal ever can be perfect!

What seemed worrying to me was that the committee had had suggested that, as someone doing a non-clinical training, s/he should leave such work and such questionnaires to others.To me this is hard to understand. S/he will have fellow students who self-harm, some who have thoughts of ending it all. One of them may well decide to talk to her/him about that after playing squash, after watching a film together.

Sure, none of us find being faced with that, easy: we shouldn’t. Sure, I learned much in a clinical training that helped me continue conversations when such themes emerged. I ended up having a 32 year clinical career in that realm and much I was taught helped (quite a bit didn’t but we’ll leave that for now!) It seems to me that a much more useful, less rule bound, reaction of an ethics committee is to ask the applicant “have you thought through how you will react if this questionnaire reveals that someone is really down?” and then to judge the quality of the answer(s). The GP-CORE has no “risk” items. It was designed that way precisely because the University of Leeds which commissioned it to be used to find out about the mental state of its students, simply didn’t want to know about risk. (That was about twenty years ago and it’s really the same issue as the ethics committee issue.)

One suggestion from the committee to the student was only to use a “well-being” measure. Again, this seems to me to be fear driven, not reality orienting. There is much good in “well-being work”, in positive psychology, and there is a very real danger that problem focus can pathologise and paralyse. However, if we only use positively cued items in questionnaires and get a scale of well-being then we have a “floor effect”: we’re not addressing what’s really, really not well for some people. We specifically designed all the CORE measures to have both problem and well-being items to get coverage of a full range of states. The GP-CORE is tuned not to dig into the self-harm realm but it still has problems, the CORE-OM, as a measure designed to be used where help is being offered to people who are asking for it, digs much more into self-harm.

Many people, many younger people, many students, are desperately miserable; many self-harm; tragically, a few do kill themselves. Yes, clinical trainings help some people provide forms of help with this. However, improving social situations and many other things that are not “clinical” can also make huge differences in Universities. (In the midst of industrial action, I of course can’t resist suggesting that not overworking, not underpaying academics, not turning Universities into minimum wage, temporary contract, degree factories, might help.)

Misery, including student misery, is an issue for all of us, not just for some select cadre thought to be able to deal with it by virtue of a training. So too, ethics is everyone’s responsibility. Perhaps we are institutionalising it into ethics committees, into “research governance” and hence putting the impossible into those systems. We create a production line for ethics alongside the production lines for everything else. Too often perhaps researchers start to think we just have to “get this through ethics” and not really own our responsibility to decide if the work is ethical. Perhaps too many research projects now are the production line through which our governments commission the research they want, probably not the research that will question them. Perhaps that runs with the production line that produces docile researchers. It’s time we thought more about ethics ourselves, and both trusted ourselves and challenged ourselves, and our peers, to engage in discussions about that, to get into collective debugging of what’s wrong. Oops, I nearly mentioned the UK elections … but it was a slip of the keyboard, it’ll get debugged out before Thursday … or perhaps it would if I wrote still needing to be on the right conveyor belts, the right production lines.

Oh, that image at the top: commemoration ‘photos, from family albums I would say, of the “disappeared” and others known to have died at the hands of the military, from Cordoba, Argentina. From our work/holiday trip there this summer. A country trying to own its past and not fantasize.

I too was a medical student in 1975. Would I have been brave? Ethical?

Data entry: two out of range item scores can really affect Cronbach’s alpha

This little saga started over a year ago when I helped at a workshop a psychological therapies department held about how they might improve their use of routine outcome measures. They were using the CORE-OM plus a sensible local measure that added details they wanted and for which they weren’t seeking comparability with other data.

In the lunch break someone told me s/he had CORE-OM data from a piece of work done in another NHS setting (with full research governance approval!) The little team that had put a lot of work into a small pragmatic study felt stymied because the Cronbach alpha for their CORE-OM data was .65 and they were worried that this meant the perhaps the CORE-OM didn’t work well for their often highly disturbed clientèle. They had stopped there but thought of asking me about it.

My reaction was that I shared the concern about self-report measures, not just the CORE-OM, perhaps not having the same psychometrics, not working as well, in severely disturbed client groups as in the less disturbed or non-clinical samples in which they’re usually developed. However, I hadn’t thought that would bring the alpha down that low and wondered if they had forgotten to reverse score the positively cued items.

As everyone’s crazily busy I didn’t hear anything for a long while but then got a message that they had checked and the coding was definitely right, would I have a look at their data in case it really was about the client group as they knew I was interested in how severity, chronicity and type of disturbance may affect clients’ use of measures.

I agreed and received the well anonymised data. About 700 participants had completed all the items and the alpha was .65 (not that I really doubted them, I just like to recheck everything!) So I checked the item score ranges though I hadn’t really thought there was likely to be much by way of data entry errors. There wasn’t: just two out of range items in over 23,000. The one was 11 and the other was 403. Changing them to missing, and hence dropping two participants resulted in an alpha of .93 with a parametric 95% confidence interval from .93 to .94, i.e. absolutely typical for CORE-OM data.

I would never have believed that just 0.008% incorrect items could affect alpha that much, even if one was 403 when the item upper score limit is 4: I was wrong! Well, perhaps it’s not quite that low a percentage. If that 11 was 1 for the one item (item22) and another 1 which should have gone into item 23 then perhaps many of the remaining items for that client were wrong; same for 403 for item 28, after all 1, 4, 0 and 3 are all possible item scores on the CORE-OM. That would take the incorrect entries up to 0.08%. However, if something like failure to hit the carriage return is the explanation then there should have been one or more missing items at the end of the entries for that client and their data would never have made it into the computation of alpha. Perhaps a really badly out of range item at a rate of just 0.008% is enough to bring alpha down this much. Only checking back to the original data will tell. I hope they still have the original data.

OK, but does this merit a blog post (well, I’ve got to start somewhere!) I think there are some points of interest.

  • it shows just how influential a few out of range scores can be
  • it shows that alpha can sometimes detect this and hooray for the people involved that they did calculate alpha and sensed that something was so wrong that they couldn’t just go ahead with the analyses they had planned
  • it does show though that simple range checks on items were a quicker and more certain way of detecting what was at root here
  • it shows that though I think you should always do all the range and coherence checks on data that you can think of making sense for the data …
  • … it’s stronger to have duplicate data entry but which of us can afford this?
  • even if you can do duplicate entry (assuming that the clients complete the measures on paper) you should use a data entry system that as far as possible detects impossible or improbable data at the point of entry
  • (and if you do have direct entry by clients please make sure it does that entry checking and in a user-friendly way)
  • but while absurd sums of money are put into healthcare data systems and into funding psychological therapy RCTs, where is the money to fund good data entry, clinician research and practice based evidence?

To finish on a gritty note about data entry, at least twenty years ago, before I discovered S+ and R I mainly used SPSS for statistics and back then, for a while, SPSS had a “data entry module”. It was slow ,which was perhaps why they dropped it but it was brilliant: you could set up range checks and all the coherence checks you wanted (pregnant male: I think not). After that died I tended to enter my data into spreadsheets and until about a year ago I was encouraging colleagues I work with around the world to use Excel (yes, I tried encouraging them to use Libre/OpenOffice but everyone had and knew Excel and often weren’t allowed to install anything else). They or I would write data checking into the spreadsheets to the extent that Excel allows and I wrote data checking code in R ( to double check that and to catch things we couldn’t in Excel. I still use that for one huge project but it’s a nightmare: updates of Windoze and seem to break backwards compatibility, M$’s way of handling local character sets seems to create problems, its data checking seems to break easily and I find it almost impossible to lock spreadsheets so that people can enter data but not change anything else. I’m sure that there are Excel magicians who can do better but I’m equally sure there are better alternatives. At the moment, with Dr. Clara Paz in Ecuador, we’re using the open source LimeSurvey survey software hosted on the server that hosts all my sites (thanks to Mythic Beasts for excellent open source based hosting). If you have a host who gives you raw access to the operating system LimeSurvey is pretty easy to install (and I think it runs on nasty closed source systems too!) Its user interface isn’t the easiest but so far we’ve been able to do most things we’ve wanted to with a bit of thought and the main thing is that it’s catching data entry errors at entry and proved totally reliable so far.