View Categories

Stanines & the stanine transformation/scoring

I remember reading about these, hm, some time in the 1980s I suspect when I thought they were a rather odd idea. I have just been reminded of them because I had been doing entries like the five number summary and seven number summaries and a note on wikipedia about them reminded me of them. I’ve never seen them used in our fields but I’m am rather more sympathetic to the motivations behind them and felt they ought to be in the glossary.

Details #

I am drawing heavily on the excellent, and short, page on Wikipedia: https://en.wikipedia.org/wiki/Stanine but I’ll summarise it here in case it changes a lot!

  • It is a way of transforming a continuous score to a nine level score (hence stanine, from STAndard NINE).
  • It’s a mapping based on quantiles/percentiles
  • The stanines are created by ranking (referential) data, so that:
    • the lowest 4% get a score of 1
    • the next 7%, i.e. up the 11% percentile get a score of 2
    • the next 12%, up to 23%, score 3
    • the next 17%, up to 40%, score 4
    • the next 20%, up to 60%, i.e. the middle 20% around the median, score 5
    • the next 17%, up to 77%, score 6
    • the next 12%, up to 89%, score 7
    • the next 7%, up to 96%, score 8
    • the last 4% score 9
    • The logic behind those funny percentiles was that they pretty much divide a Gaussian distribution into .5 SD steps (except for the end ones that sort of mop up what’s left at the edges)
    • That means, as you can see from the percentiles, that the scores won’t contain the same proportions of the score distribution there are two stories about the reason
    • It was first used by the U.S. air force in WW-II “for test scores” (I’d be interested to know what tests)
    • There are two stories about the logic. One is that, as for any nine level transformation in a decimal number system, it gets you the most levels you can get from one digit so spares you using two digits. The other, which I love, is that it would “reduce the tendency to try to interpret small score differences” (apparently that came from Thorndyke, one of the key figures in mid-20th Century psychometrics).

To me looking at this now, at least thirty years since I first saw it, there are four interesting points about this and they touch on why I think I’ve never seen stanines used in the therapy change measurement world.

  1. There is a logic that if you have a score that has more than nine levels you are throwing away information and, even if the reliability of the measure is such that there is little informational value in say a score difference that would map to the same stanine score, it is still true that you lose a tiny amount of statistical precision so if you really need to know if a mean difference between two groups is important or not, then you should never rescale/transform to a small number of levels in the scores. (The ultimate example of this is dichotomising scores which we so often do and which can lose a huge amount of statistical power/precision, transforming to nine levels loses very little by comparison.
  2. However, Thorndike’s point is important, they probably didn’t want air force personnel arguing because one was triumphant about scoring say 73 on some score (say from 0 to 85) and another 71 when the difference between 73 and 71 was extremely likely to have been down to imprecision of measurement. I think our field often overvalues small differences between scores and particularly, going back to dichotomising, when one score is one point above a cut-off point and the other one point below it.
  3. The other important point for our field is that we are often interested in changes in scores over time for just one person. Here all the psychometric and statistical tools that are based on the assumption that measures work in exactly the same way for any individual (as, say, a reasonable weighing scales, or height measure, or a blood test will) may be wrong and we might be moderately seriously throwing away meaningful information if we transformed even down to nine levels. But that opens up a lot of issues that go beyond this glossary entry!
  4. The wikipedia entry is written as if using the stanine principle as a data transform, i.e. using the centiles in your own data to convert raw scores to stanine scores. However I suspect that in the educational world where stanines are used quite a lot, the mapping will be based on quantiles from large referential samples/datasets. To get the scores that mark these 4%, 7%, 12% … 96% percentiles so they won’t fluctuate much from one sample/dataset to the next you need your referential sample/dataset to be very large, into many thousands. Our field very, very rarely has such datasets even if we were keen to have stanine mappings for our measure scores.

Even small print: a sten transformation/score (Standardised TEN I assume) is like a stanine but has ten levels (so losing the advantage that you can code it with a single digit.

Try also #

Chapters #

Not covered in the OMbook.

Online resources #

My rblog post about dichotomisation explains the issue noted above about dichotomisation.

My shiny apps have one (ECDF plot with quantiles and CIs for quantiles) that allows you to upload data and shows the imprecision of estimating quantiles. If you put in:
.04, .07, .12, .17, .2, .4, .6, .77, .89, .96
for the quantiles you want you are asking for a stanine mapping of your data. You could pull these CSV files which contain datasets from a standard Gaussian distribution to your machine and upload them to that app to see the effects of dataset size:

I might create some things that might touch on this though I think they’d be explanatory rather than of real use for routine data … for the reasons above including that I’ve never seen stanines used in our field!

Dates #

First created 12.iv.25, updated adding link to sten score glossary entry 1.v.25.

Powered by BetterDocs