View Categories

Reliability

It’s doing what you say you’re going to do isn’t it? Yes, and I like it when I meet people who are reliable! However, but this is the psychometric use of the word, its application to measures and measurement. In this technical use reliability covers a number of potential properties of ways of measuring things and of measures.

Details #

In fact that ordinary use of the word gets formalised in psychometrics as test-retest retest reliability: a reliable measure of something that shouldn’t change, at least not across the test-retest interval, should show little change in scores as different things are measured twice (or more often). If we really can assume that the true score hasn’t changed across the test-retest interval then we can treat the variation in scores as unreliability. Mind you, test-retest stability/reliability is just one form of reliability estimation.

More generally reliability is one of two primary qualities of a measure, the other being validity. Unreliability is contamination of scores by random error of measurement, invalidity is contamination by systematic error. To take a silly but easy to understand example, say we have an old bathroom scales:

  • If it’s badly calibrated and weighs everything about 5% heavier than it should then it has poor validity and a systematic bias. This is not a random error of measurement, it’s a ratio, or multiplicative, bias of +5% and will show zero when there is nothing on it. It actually has good reliability.
  • If it’s badly calibrated and even with nothing on it it shows 200g and always weighs everything 200g heavier than it should then again it has poor validity, it actually has a constant additive bias of 200g (and you’d have hoped to have noticed this by checking what it said with nothing on it). Again, it has good reliability.
  • If it is inconsistent, perhaps the dial/pointer is sticky and no two measures of the same thing are the same exactly, they vary by a few grams every time that something is taken off them and put back on them then they (probably) have unreliability: random variance contaminating your readings. You might index this using the standard deviation or variance of the repeated measurements or, if you know the correct weights you are putting on the scales, you can index this with “mean squared error”.
  • However, you might notice that the variation isn’t random, in fact, it’s pretty consistent that the first weight is about 3% higher than the second. Now we have a systematic effect not a random one and we’re back to poor validity rather than unreliability. Similarly, it might be that the readings are affected by the ambient temperature and are lower the warmer the surroundings. That is a non-random source of error so it’s invalidity not unreliability but if you don’t notice the relationship with temperature it will seem like unreliability.

OK, we’re not really interested in bathroom scales but it does provide a clear introduction to “measurement models”: how we understand is going on when we use measures. The measurement models for a bathroom scales are nice and simple as the scales give a single value/score every time something is measured. Measurement models for multi-item questionnaire measures are more complicated as such measures work by adding up multiple different indicators of the measurement we want.

The other complication is that sometimes possible biasing variables/effects are fairly clear like the difference between the first and second weighing in the example of the scales above. However, often we have many potential or even highly plausible biasing effects. As noted above, the example of temperature is one such. If higher temperatures tend to reduce the weight the scale reports then we have a systematic biasing effect. However, if we didn’t know that this was a possible bias and tested the scales with the same range of weights in many settings without recording the air temperatures we would have rolled the effect of temperature and treated it as random and it would be reported as unreliability. If we suspect an effect of temperature we might measure the temperature at the same time as doing each weighing and put into our measurement model as a potential biasing effect. Pragmatically, this only really matters if the biasing variable is something we could control: so precise weighing might be done if we take into account ambient temperature even though it is distorting the direct readings.

Apart from generally using multiple indicator measures the other complexity in our field is that we almost never have “correct” measures: we don’t have a set of people of known and constant well-being we could ask to complete our measures. The difference from calibrating scales is fairly clear and it has a big impact on what is often called “validation” of our measures but which I think should be called “psychometric exploration” (or just perhaps “psychometric evaluation”). Because we don’t have “correct” measures of internal states we not only don’t know in advance whether some things are biasing or not but it may also be that they are valid impacts on the correct scores not sources of invalidity in the measure. The commonest example is probably gender. A lot of work on MH/WB/psychological measure aspires to show that they have “measurement invariance” across gender, i.e. that gender doesn’t bias the scores. But what if gender, at least in some cultures and countries really does affect well-being? Is it likely in most contemporary cultures and countries that people rejecting binary gender classification will have the same well-being as either those accepting the male or the female category? If we choose items for a new measure well-being so their combined score shows to effect of gender is it that we have produced a new, gender unbiased measure or is it that we have reduced the domains of well-being that the measure covers and reducing its content validity as we removed items that tapped real effects of gender on well-being (= effects that impact on “real” well-being). (Please: don’t go creating new measures of well-being for the general or most help-seeking populations, we have enough already … of course if you think we need a new one for say a neuro-atypical population, perhaps you have a case.)

A similar issue is of changes in the real state of participants in test-retest explorations: many issues that concern us are not fixed but likely to vary, perhaps differently, for different people. This means that test-retest reliability explorations are going to confound “real” change and unreliability of measurement. How serious this is will vary with what is being measured and with the test-retest interval.

One final caveat about reliability estimation for our measures is that that our measures are not like a scales, bathroom or laboratory. For scales the only active involvement in creating the score of the participants is in standing on the scales and remaining fairly stable on them until the measurement stabilises. By contrast participants completing our measures, interview as well as questionnaire, are actively involved in deciding how to answer every item. This can mean that reliability and validity explorations may come up with markedly different values for the same measure when explored in different groups of participants. There is strong evidence that some of us vary our answers more, week on week say, than others making the single test-retest reliability measurement a bit of a compromise; equally many other issues, surely including forms of “neurodiversity”, distraction, conscious or subconscious wishes to present ourselves in particular ways, all impact on explorations of our measurement models.

Despite these very real challenges, exploration of the reliability of measures, and explorations of the even more complicated area of validity remain really important to tell us about the likely value to us of scores on the measure. Those explorations probably should include exploration of the effects on scores, item scores and total scores, of gender, age and a variety of other variables . But please let’s stop saying that a measure “has known reliability” or worse “has known reliability and validity”. These are mantra reassurances not real information.

Try also #

Composite measures/scores
Cronbach’s alpha
Factor analysis
Internal reliability/consistency
Latent variables
McDonald’s omega
Test-retest reliability/stability
Validity

Chapters #

Chapters 3, 4, 7 and 10.

Online resources #

None yet.

Dates #

First created 6.iv.24, tweaked to add links 3.ix.25.

Powered by BetterDocs