Statistical inference and estimation of internal reliability

Information from old site created 09.x.14, relocated and rephrased here 3.i.19, updated 4.i.19.  All content is made available under a Creative Commons License. Please feel free to reuse anything here but respect the licence, i.e. give attribution back to here. 

If you use the same multi-item instrument in two different samples you should never assume that reliability is the same in both samples and the whole idea that multi-item measures such as questionnaires and many rating scales have fixed internal reliability is rubbish.  And the sentences like “Faust’s measure of grandiose impulses has proven reliability, Cronbach alpha .93 (Machiavelli and Evans, 1992).” really ought to be banned from all respectable journals: they are almost nonsense and part of the general pretence that such measures are much more certain and similar to measures of the natural sciences than they actually are.

As with any other sample measure that (fictional!) .93 is from a sample.  If the sample was large and sampled in some sensible enough way, it may be a very precise reflection of a population value but it should have a confidence interval like any other sample statistic.   Sentences like that are typical in papers in which the measure is being reused in a new sample.  If the new sample and that reference one are from the same population, or not radically dissimilar in their composition, then the internal reliability in this new sample may be only trivially different from that in the cited paper, though that may still be statistically significant if you have huge samples. 

However, reliability differences can be quite large if you are using the measure in very different samples.  A good example is using a measure in different cultures s we showed rather nicely (I think) back in the last century (Evans, Dolan & Toriola (1997). Detection of intra- and cross-cultural non-equivalence by simple methods in cross-cultural research: evidence from a study of eating attitudes in Nigeria and Britain. Eating and Weight Disorders2, 67-78).  Another example is using a measure of something gender sensitive like body image in men and women.  

In these situations, we should always compare the sample internal reliability values.  A good diagnostic check that will detect problems is to test the Cronbach alpha values in each sample. Pretty much all statistics packages will compute alpha, and SPSS has actually given it a 95% confidence interval for ages now.  (It’s the CI of the “two-way mixed, consistency” ICC in RELIABILITY.)  However, as far as I know, still none of the major packages offers a test of the difference between two alpha values despite Feldt having worked out the parametric statistics of the test in 1969 (see readable summary in Feldt, Woodruff & Salih (1987). Statistical inference for coefficient alpha. Applied Psychological Measurement11(1):93-103).

To remedy this I’ve created an online form you can use to carry out Feldt’s test. That’s OK for formal testing the statistical significance of the difference, we also want confidence intervals around individual values, to get these use this form which gives you the confidence interval. 

Both forms use R to do the calculations and use Feldt’s parametric methods, i.e. assume Gaussian distributions.  For the last few years I have used R to get non-parametric bootstrap confidence intervals (and the bootstrap CI for the difference between alphas in two independent samples if I have the data for both) rather than these methods.  However, from quite a lot of those computations to date, I’ve always found, the differences between the bootstrap and the parametric intervals have been small and the great advantage of the parametric approach is that you only need the observed, or reported, alpha value, the n and the number of items.

These are links to some things that are probably of very minor or historical interest but using Feldt’s methods.