Mokken scaling

Mokken scaling is one of a number of “Item Response Theory” (IRT) ways of analysing questionnaire responses.

Details #

Unlike “Classical Test Theory” ways of looking at questionnaire response data which assume that the numbers allocated to responses, i.e. 0 for “Not at all”, 1 for “Only occasionally”, 2 for “Sometimes” and so on, can be treated as interval scaling numbers i.e. can be treated as if the difference between 0 and 1 is the same as between 1 and 2 even if it can’t be assumed that the zero is a true zero or that 2 is twice the magnitude of 1. That is to assume that the numbers are like Centigrade or Fahrenheit temperature measurement which have those properties (unlike measuring temperature in Kelvin where zero is a true zero and unlike measurement of height or weight or blood concentrations all of which have the true zero and ration scaling).

Clearly assuming that these numbers have interval scaling is a dubious assumption. IRT proceeds differently simply assuming that across any items that form a unidimensional scale (i.e. only relate to one latent variable on which people differ though contaminated with “error”) can be each be seen to have different “item response” profiles (hence the “IR” in “IR”). The basic assumption in IRT is that people differ on the latent variable the scale seeks to measure: some have more of the variable, some less and that the items differ in how it becomes more likely that people with more of the latent variable are more (or less if the item is cued the other way) likely to agree to the item. (That’s the model for binary response, i.e. “yes”/”no” items, the mappings are more complex where there are more than one response category for each item but the theory is similar.)

Different IRT models have different assumptions about those profile shapes, i.e. about how the likelihood of a positive response increases in relation to the person’s position on the latent variable. (More positive response if the scale has ordinal rather than binary response categories). Mokken scaling differs from other IRT models in having a non-parametric expectation of the response curve rather than one defined by assumptions about the shape and one, two or three defining parameters for each item. The only assumption in Mokken scaling is that the response curve is monotonic, i.e. for any increase in how much of the latent variable the person has, it becomes more likely that they will respond positively (more positively) to any item (assuming all items have been scored to the cued the same way).

Mokken scaling has two forms but both share that “monotonic” response curve expectation. In the “double monotonicity” model the assumption is that the response curves for the different items never cross. Here’s a simulated example:

In the other, the “monotone homogeneity” model the model still requires that the response curve be monotone: always increasing between people having stronger scores on the latent variable. However, this model allows that some items may have a weaker relationships with the latent variable than other items. This means that response curves for items may cross with those of other items. This is shown here

There item two has a weaker relationship with the latent variable than item one though both still have the required monotone increasing probability of a person answering “yes” to the item the stronger the true score of the person on the latent variable.

[Both those examples are created using a parametric model for the response curves hence the nice ogive curves, in the Mokken model the curves wouldn’t have to be so smooth but they must be monotone.]

“Local stochastic independence” #

The one crucial requirement for estimation of the Mokken characteristics of a measure to be possible is that all the items of the measure “are locally stochastically independent”. This is essentially saying that the likelihoods of any pair of items being answered positively is only affected by the latent variable and not by any other aspect of the participants or the items, i.e. the items must be pure measures of the latent trait and uncontaminated by any other source of variance. I can see that this is necessary for the maths of Mokken scaling (and any IRT model I think) to work but it’s very unlikely to be true for any measure in our areas. This in no way invalidates using Mokken scaling or other IRT: they are clearly very useful if your measure is not far off undimensional (or if the items you analyse from it are nearing unidimensional) and if contributions to responding from other variables are small (which pretty much says the same thing in different words). When invalidating variance and deviations from unidimensinality are small then these methods can really help us explore how plausible the mapping of responses to scores looks. However, for me this is rather like seeing how implausible the assumptions of CTT look: IRT of all types is dependent on another set of assumptions, amounting to this “local stochastic independence” which are just as necessary for IRT to work as the assumptions of CTT are for it to work but all these assumptions are never going to fit perfectly for our measures and the realities of measuring change in psychosocial interventions. We need to use them with humility rather than hubristic excitement about their apparent power to give us precise measurement of things that are never directly measurable.