The idea that a set of data is a “sample” is so ingrained in the research world that applying the term to any set of data is automatic and even used in that way in research journal style guides I’m sure (but I ought to find examples!) This has to change.

Details #

Why am I trying to change this?! Because it conflates a very important idea, essentially an ideal research model, or a thought experiment, with real life and that’s dangerous as it’s a major contributor to our literature being full of overstatements, overconfident presentation of findings. Sometimes those are genuinely interesting findings, sometimes not very interesting at all, but they are presented as generalisable, or even as “facts”. Findings about a dataset should be “facts”: if they’re not then something has gone wrong. They’re never facts about the wider world.

The problem is that the “random sampling from an (infinite) population” at the heart of inferential testing and the Null Hypothesis Significance Test is a very strong statistical model. It creates a very good model to give us some handle on our uncertainties generalising from “sample” data to a population. The danger arises when we overstate how well the model fits the real world in which the dataset arose, and make generalisations, extrapolations, from our analyses of our dataset, to the wider world.

We can’t do anything other than recognise that our sampling won’t be random. (I’m pretty sure I’ve never seen a report in which the data really were a random sample from a large population that was the real population to which we want to be able to generalise our findings.) We should acknowledge that a population homogeneous in whatever we’re interested in is rare or never realistic. We can be careful to acknowledge where other assumptions of our statistical analysis, particularly independence of observations (far more important usually than distributional assumptions) are violated. What we should be doing is listing these issues and, where we can, estimating their possible impact on any generalisations we are making.

All this is almost never done. One reason I am sure is that it runs against all the pressures to claim far more than we should, it would be obvious that our estimation of the impacts on generalisations our findings may be little more than guesswork or very depressing. If only 10% of the people invited to participate in a survey responded but that was still 500 respondents of the 5000 contacted (I like to make the arithmetic easy!) Suppose 100 of the 500 said they had had therapy so we could report this as saying that 20% of the population have had therapy. We could be meticulous and say that the 95% confidence interval of 100 of 500 is from 17 to 24% (it is, see here if you want to check).

That sounds pretty precise (having n = 500 gets you quite good precision!) But we don’t have a random sample. It’s very likely that some people who didn’t participate are antipathetic to therapy and would never have therapy, it’s also possible that some people had therapy and were left with very grim memories and won’t answer the question. If all the non-responders were “never have therapy” people then the number of the 5000 who have had therapy is still 100 but now the proportion is 100 of 5000: 2% not 20% (and the 95% CI is from 1.6 to 2.4%). On the other hand if all 4,500 non-responders were those who have had therapy and left hating the memory we have a prevalence of 4,600 of 5,000: 92% with CI from 91.2 to 92.7%.

Of course this is unfair and these models are even more implausible than the model that the 500 were a random “sample” of the 5,000. However, this very crude “sensitivity analysis” of the robustness, does underline the uncertainties and shows that extrapolations, generalisations, from this dataset are really guesses.

There’s one excellent paper about using statistical methods:
Wagenmakers, et al. (2021). Seven steps toward more transparency in statistical practice. Nature Human Behaviour, 5(11), 1473–1480. The sixth of their seven recommendations is: “interpreting results modestly”. (The others are excellent too if some are a bit beyond the resources of practitioners and even most therapy researchers outside the few “therapy research factories” of the global north.)

In line with this need for us to get more modest and honest about our data and our models, I am dropping the word “sample” and only using “dataset” in my work. I hope this will help chip away at the dangerous association of ideas that leads people to suggest that our datasets can safely be treated as a random samples and strong generalisations from them can then be drawn from sampling model statistical analyses.

There is another rather small print but still important occasion when we shouldn’t use the word “sample”: when what we have is a complete census and we’re not (not if we’re honest) trying to generalise to some population. For example, we may have a complete dataset of CORE score changes across therapy from a pioneering services offering therapy to people with severe facial scarring following acid attacks. One question of interest and importance may be whether being in an intimate relationship that has broken up after the attack is associated with higher scores. Here there is no necessary ambition to extrapolate to all possible clients in this situation. Interestingly, it’s actually a situation in which a different statistical model can be used: permutation analysis which has no sample/population model. But that’s for another entry and some new shiny apps!

Try also #

Inferential testing
Null hypothesis significance tests (NHST)
Permutation tests
Sensitivity analysis

Chapters #


Online resources #

Shiny app to give confidence intervals around a simple proportion (assuming random sampling and independence of observations!)

Dates #

First created 27.viii.23.

Powered by BetterDocs