Reliability

It’s doing what you say you’re going to do isn’t it? Yes, and I like it but this is the psychometric use of the word, its application to measures and measurement.

There is an overlap into test-retest retest: a reliable measure of something that shouldn’t change (at least across the test-retest interval) should show little change in scores as different things are measured twice (or more often). If we really can assume that the true score hasn’t changed across the test-retest interval then we can treat the variation in scores as unreliability. Mind you, test-retest stability/reliability is just one form of reliability estimation.

Details #

More generally reliability is one of two primary qualities of a measure, the other being validity. Unreliability is contamination of scores by random error of measurement, invalidity is contamination by systematic error. To take a silly but easy to understand example, say we have an old bathroom scales:

If it’s badly calibrated and weighs everything about 5% heavier than it should then it has poor validity and a systematic bias. This is not a random error of measurement, it’s a ratio bias of +5%.
If it’s badly calibrated and even with nothing on it it shows 200g and always weighs everything 200g heavier than it should then again it has poor validity, it actually has a constant additive bias of 200g (and you’d have hoped to have noticed this by checking what it said with nothing on it).
If it is inconsistent, perhaps the dial/pointer is sticky and no two measures of the same thing are the same exactly, they vary by a few grams every time something is taken off them and put back on them then they (probably) have unreliability. You might index this using the standard deviation or variance of the repeated measurements or, if you know the correct weights you are putting on the scales, you can index this with mean squared error.
However, you might notice that the variation isn’t random, in fact, it’s pretty consistent that the first weight is about 3% higher than the second. Now we have a systematic effect not a random one and we’re back to poor validity rather than unreliability.

That’s all very neat and actually very helpful in creating “measurement models” of what is going on when we use measures. The measurement models for a bathroom scales are nice and simple as the scales give a single value/score every time something is measured. Measurement models for multi-item questionnaire measures are more complicated as such measures work by adding up multiple different indicators of the measurement we want.

The other complication is that sometimes possible biasing variables/effects are fairly clear like the difference between the first and second weighing in the example of the scales above. However, often we have many potential or even highly plausible biasing effects. Perhaps a plausible one for bathroom scales might be air temperature. If higher temperatures tend to reduce the weight the scale reports then we have a systematic biasing effect. However, if we didn’t know that this was a possible bias and tested the scales with the same range of weights in many settings without recording the air temperatures we would have rolled the effect of temperature and treated it as random and it would be reported as unreliability. if we suspect an effect of temperature we might measure it with each weighing and put into our measurement model as a potential biasing effect. Pragmatically, this only really matters if the biasing variable is something we could control: so precise weighing (not with bathroom scales) might be done.

Apart from generally using multiple indicator measures the other complexity in our field is that we almost never have “correct” measures: we don’t have a set of people of known and constant well-being we could ask to complete our measures. The difference from calibrating scales is fairly clear and it has a big impact on what is often called “validation” of our measures but which I think should be called “psychometric exploration” (or just perhaps “psychometric evaluation”). Because we don’t have “correct” measures of internal states we not only don’t know in advance whether some things are biasing or not but it may also be that they are valid impacts on the correct scores not sources of invalidity in the measure. The commonest example is probably gender. A lot of work on MH/WB/psychological measure aspires to show that they have “measurement invariance” across gender, i.e. that gender doesn’t bias the scores. But what if gender, at least in some cultures and countries really does affect well-being? Is it likely in most contemporary cultures and countries that people rejecting binary gender classification will have the same well-being as either those accepting the male or the female category? If we choose items for a new measure well-being so their combined score shows to effect of gender is it that we have produced a new, gender unbiased measure or is it that we have reduced the domains of well-being that the measure covers and reducing its content validity as we removed items that tapped real effects of gender on well-being (= effects that impact on “real” well-being). (Please: don’t go creating new measures of well-being for the general or most help-seeking populations, we have enough already … of course if you think we need a new one for say a neuro-atypical population, perhaps you have a case.)

A similar issue is of changes in the real state of participants in test-retest explorations: many issues that concern us are not fixed but likely to vary, perhaps differently, for different people. This means that test-retest reliability explorations are going to confound “real” change and unreliability of measurement. How serious this is will vary with what is being measured and with the test-retest interval.

One final caveat about reliability estimation for our measures is the problem, related to the complexities of biasing variables like, potentially, gender, is that that our measures are not like a scales, bathroom or laboratory. For scales the only active involvement in creating the score of the participants is in standing on the scales and remaining fairly stable on them until the measurement stabilises. By contrast participants completing our measures, interview as well as questionnaire, are actively involved in deciding how to answer every item. This can mean that reliability and validity explorations may come up with markedly different values for the same measure when explored in different groups of participants. There is strong evidence that some of us vary our answers more, week on week say, than others making the single test-retest reliability measurement a bit of a compromise; equally many other issues, surely including forms of “neurodiversity”, distraction, conscious or subconscious wishes to present ourselves in particular ways, all impact on explorations of our measurement models.

Despite these very real issues, exploration of the reliability of measures, with explorations of the even more complicated are of validity including exploring the effects on scores, item and total, of gender and other variables remain really important to tell us about the likely value to us of scores on the measure. But please let’s stop saying that a measure “has known reliability” or worse “has known reliability and validity” (validity, is not just one thing, come to that, neither really is reliability with the two broad explorations through test-retest and internal reliability (and the third of inter-rater reliability for interview measures).

Try also #

Composite measures
Cronbach’s alpha
Factor analysis
Internal reliability/consistency
Latent variables
McDonald’s omega
Test-retest reliability/stability
Validity

Chapters #

Chapters 3, 4, 7 and 10.

Online resources #

None yet.

Dates #

First created 6.iv.24.

want to suggest changes or got questions?

Updated on 6th April 2024

Details #

Try also #

Chapters #

Online resources #

Dates #

Share This Article :

How can I help? (Be aware: I do this in my spare time so may not be able to help... but I will try!)