Null hypothesis significance testing (NHST) paradigm

This, essentially the same as “inferential testing” (q.v.), is the paradigm which has dominated the quantitative research in the fields of psychology, MH and therapies generally. NHST has some great strengths but it’s generally used without any real thinking about what it involves. Since 2016 when the American Statistical Association published its “P-Value Statement” it has become rather trendy to decry the NHST but as so often, there’s nothing wrong in the maths, it’s what researchers did with it over a century or so since it was invented/discovered that is the problem. Our fields seem ambivalent about tools that help us understand, but which don’t remove, uncertainties, too often publications claim or intimate at having established certainties and hiding uncertainties is endemic. I don’t think we’re going to fix that for another century but understand what the NHST does and what it doesn’t do may help!

Details #

The basic idea is beautifully simple: you have a question, you collect data to address the question. In the real world there’s randomness and complexity in what’s going on and generally what is going on is more complex than in many physical science questions. However, you have some quantification hence our exploration is not purely qualitative, purely narrative. (Tangentially, the misrepresentation of, and overvaluing of, the NHST has almost certainly been at the heart of much denigration of qualitative/narrative methods.)

This basic idea in the NHST can be a great tool to say whether the data support the interpretation that something systematic may be going in within the random noise and complexity.

How does it work? #

You start by framing the question well (it may or may not be particularly interesting or important). Let’s say you want to know whether a counselling intervention is associated with a fall in HbA1c, a measure of diabetic control, in a sample of people with poorly controlled diabetes mellitus. Of course you’re not just interested in that sample, you want to make principled generalisations beyond your sample, that’s what the NHST should help.

Let’s say the intervention is fortnightly over six months. You maybe have the resources to offer the intervention to 30 clients and to measure their HbA1c at the start and end of the intervention. You know that the measurement is fairly reliable and valid, but like all such measurements, not perfect: there will be some measurement unreliability, i.e. some random noise in the values. More importantly (probably) you know that the clients will vary considerably in their starting HbA1c, will show some variations in their HbA1c levels from month to month for many reasons that you can’t measure and almost certainly there will be individual differences in their responses to the intervention so there is more noise entering the system there. However, in the perfect world of thought experiments, every client completes the intervention and you end up with 30 pairs of HbA1c values. However, as we’re creating a realistic thought experiment there is a lot of variety in the changes in values. How do you decide what to make of these 60 numbers you have?

The NHST method starts with a null hypothesis: an assumption, a model, of what might be going on. The elegance of the “null” to assume nothing is going on in the (essentially theoretical) population of all clients who could ever have this intervention and have their HbA1c measured twice at that interval around the intervention. That null hypothesis made precise by assuming that it applies across that infinite (I said this was theoretical) population so across that population the mean change would be exactly zero and that the values you saw in your sample of 30 differed in from zero (they will, even in thought experiments that acknowledge noise and complexity) only because of sampling vagaries: that you only had those particular 30 participants and of course their mean change isn’t exactly zero.

In this model you can use the data you have to say how likely it is that you would have seen a difference mean change from zero as big or bigger than you did (in either direction: increase or decrease) had only randomness been affecting things, i.e. if the intervention actually has zero association with change in HbA1c over that period.

As well as the null hypothesis you need a few other key hypotheses to create a complete mathematical model here:
(1) independence of observations (q.v.)
(2) random construction of the sample: that these 30 really are a random sample from the infinite population (always untrue in our world, less so in some other experimental situations)
(3) that all people may vary in their HbA1c response to the intervention but that their differences are of degree, not of undergoing different processes such that you actually have multiple populations with different responses. In our world that may or may not be true, see “mixture models”. You can be a nihilist and argue that this is unlikely to be true for any psychological process of interest in real human beings, however unless narrowed down a bit that leaves you throwing out essentially any systematic exploration of anything to do with humans.
(4) that the distributions of any effect and of the randomness has a distribution that you can build into the maths. Traditionally, that was the Gaussian. (See parametric tests) or else you had to reframe your null hypothesis to be about something other than the mean change (see non-parametric tests and the Mann-Whitney or Wilcoxon tests, and bootstrapping, but that’s getting away from our focus here).

OK, sorry about the small print but it is vital. Having made all those other assumptions we can now do the maths and get a probability that we would have seen an absolute (i.e. regardless of sign) mean change as great or greater than you did. For the model above this is the famous “paired t-test” (it’s paired because you are looking just at changes: the difference between pairs of values).

Now the final step: before you did any of this you said how unlikely that probability would have to be for you to “reject the null hypothesis” and hence, of necessity, accept the alternative hypothesis. That brings us to the alternative (I think “alternate” in US English). This simply the complement of the null hypothesis: the alternative model that, with the null, covers all possibilities. Hence the alternative is that there is some non-zero mean change in HbA1c across the intervention if could have the data from all the members of the infinite population undergoing the intervention.

Note that quantitatively, the null model is a “point null”: it is precisely that there is zero effect in the population. The alternative is a the single logical alternative, i.e. that there is some population effect. However, that is not a point model quantitatively: it is all other possibilities from a tiny mean change positively to a huge mean change positively and from a tiny mean change negatively to a huge mean change negatively. Rejecting the null model is a logical decision but it doesn’t tell us anything quantitatively about the population mean.

So what probability do you set that is sufficiently small that you will reject the null hypothesis? In an intriguing convention, said to go back to the explanation of the t-test to the directors of the Guinness brewery in Dublin early in the 20th Century, this is almost always set at .05 and “p < .05” a one in twenty risk of accepting the null when it’s not the correct model for the population).

What’s the problem? #

There’s absolutely nothing wrong with the maths or the models: they give us a logical, mathematically correct way to decide if data suggest something systematic is going on in the population. I.e. they are helping us understand the risks we run in making the “generalisations” (q.v.) from the data we have that we choose to make.

However, the model never gives us certainties, it can’t: we only have a finite sample of data and complex processes and noise in the system and in the measures. It just allows us to say our risk of a “type I” error: of deciding to reject the null model when it was actually true in the population (which we had, following the Guinness brewery directors, set at a one in twenty risk.

As it stands there, the method doesn’t tell you your other risk: of accepting the null even if, in the population there is some effect. That risk can be estimated, using the same model, for whatever quantitative non-null model you assume for the population: so called “statistical power” estimation (q.v.).

For some situations, where the sampling is theoretical, or in simulations, the model may be all we need to make robust (q.v.) decisions, robust generalisations, to suggest probably useful implications. However, as you can see from the list of requirements for the model to fit the exploration, we almost never have all of them for MH/therapy data so our interpretations of the test must always be given some narrative, and perhaps some quantitative (see “sensitivity analysis” and robustness) exploration and cautionary qualification.

This brings us to the two huge problems with the NHST: those for individual reports and, more seriously, the way the method became a dominant paradigm in our field and reinforced the promotion of a method: the randomised controlled trial, way beyond the robust interpretability of the findings.

Problems at the level of the individual study #

For individual reports the problem is that studies that used NHST methods, appropriately or less so, almost always ignore all the assumptions that build to the NHST decision. To, perhaps unfairly, I’m going to select one example:
Carmichael et al. (2021). Short-term outcomes of pubertal suppression in a selected cohort of 12 to 15 year old young people with persistent gender dysphoria in the UK. PLOS ONE, 16(2), e0243894.
This is an important and perhaps influential look at routine change data. The results include this:

For the CBCL total t-scores, there was no change from baseline to 12, 24 or 36 months. Similarly for the YSR total t-score, there was no change from baseline to 12 or 24 months;

The pairs of change scores at 12, 24 and 36 months numbered 41, 20 and 11 of the original 44 participants. No-one will be surprised to know that there was not “no change” in these 144 changes, there was change: it would have been a miracle that all 144 pairs of scores showed no change. This is an extreme example of claiming to have “proved the null” where the authors have not just implied that they have shown there was no change in the mean (not actually true either but the changes were small) but gone a step further to say something actually completely incompatible with the findings they themselves report. This should probably have been reported as
“For neither the CBCL nor YSR total t-scores, were there any statistically significant change in the mean score from baseline to 12, 24 or 36 months using paired t-tests.”

This is an extreme example but lesser, but equally misleading, examples are common, papers carefully spelling out all the reservations about generalising from a single set of data and an NHST are actually a tiny minority.

The systematic, paradigmatic problems #

This is really about the wider problem. Gosset, who invented the t-test intended it to give the brewery directors a way of assessing quantitative findings and putting their uncertainties and generalisability in a statistical frame that didn’t remove uncertainty, but which offered them a decision making process to be used in full awareness that it was only going to be accurate in quantifying the uncertainty if the various assumptions of the model fit the process that created the data. For the brewery the assumptions were probably not far wrong and appropriate experiments could be constructed fairly easily to match the method. For double blind randomised controlled trials (RCTs, q.v.) in the pharmaceutical realm the same is probably largely true: the NHST fits analysis of a single trial well.

For therapy exploration the assumptions of the NHST are very rarely true, our RCTs are inevitably not blind, are samples are never really random, nor our populations infinite or homogeneous. Given this we should be extremely cautious in presenting our NHST analyses and their implications. However, I suspect that if we were systematic in that honest way the NHST simply wouldn’t have had the dominance it has and I suspect we might have made a lot more progress in the last fifty years or so of therapy research!

I am sure that will sound offensive and negative to many readers but think: how did that statement in Carmichael et al. (2021) get through peer review (q.v.)? I suspect that this is diagnostic of our systemic problem overvaluing these methods: misusing them is so common that it would seem that probably at least two peer reviewers and the editor failed to notice such an extreme misrepresentation of the findings.

Try also #

Bootstrapping / bootstrap methods
Confidence intervals
Inferential tests
Non-parametric tests
Parametric tests

Chapters #

Touched on in Chapter 5, in many ways Chapter 10 is all about a different, less mad, more constructive way we should be thinking about our questions, our data and our analyses.

Online resources #

None yet and if there are, they will present both NHST and estimation methods with some clarification of the issues with both.

Dates #

First created 20.viii.23, updated 9.i.24.

Powered by BetterDocs