View Categories

Fisher’s exact test

Like the chi squared test this is a paradigmatic statistical test for association in a contingency table. It is widely accepted to have been invented by Ronald Fisher, who is often seen as almost the single handed creator of modern statistics (but a man with repugnant views on “race”).

Details #

The story behind the test is that Fisher offered Muriel Bristol a cup of tea poured from an urn and she rejected it saying that she would have to add milk to it and that she preferred the milk to have been put in the cup before the tea. Fisher didn’t believe she would be able to taste the difference and proposed a test which involved her being offered eight cups of tea, four which had had the milk before the tea and the other four the tea before the milk. History says that Muriel Bristol correctly identified all eight cups and that Fisher worked out that the probability of getting all eight correct in that situation was one in 70, sufficiently unlikely that he accepted that she could taste the difference.

This was one of the tests that established the idea of null hypothesis significance testing (NHST). I love the back story which created the title of a great book about the history of statistics: Salsburg, D. (2001). The lady tasting tea.  How statistics revolutionized science in the twentieth century. Henry Holt. ISBN: 0-8050-7134-2.

The maths of the test depends on the “marginals” of the contingency table are known by design. For this simple 2×2 table that translates into Muriel Bristol knowing that four of the eight cups had the milk in first and that she would therefore use the same “marginal” i.e. 4 and 4, in her categorising the cups. Here is the contingency table for her answers. You can see why the marginals are so-called: they are the totals in the margins of the table.

What MB said:Milk firstTea firstTotal
Milk was first404
Tea was first044
Total:448

As the name suggests, this test is an example of an “exact” test: the maths give an exact answer for the p value for the particular design and total n. That is not true for a lot of statistical tests which give approximate p values based on the maths of infinitely large distributions. For most such approximate tests the p values are good enough down to quite small real sample sizes.

This issue about “fixed marginals” does matter and the marginals can be fixed by design for quite a lot of experiments like the tea tasting. However, for general contingency table data the marginals are not fixed and technically Fisher’s test is inappropriate. What does this mean? Say a practitioner works in two services and reviews her year’s clients’ “outcomes” on a simple binary “reliably improved” vs. “not reliably changed or reliably deterioriated” she won’t know in advance that the referrers will have made the same number of referrals nor what the improvement will be: here these “marginals” are functions of the data. Her outcome table might look like this.

ServiceABTotal
Improved183654
Not improved628
Total:243862

Being empirical aspects of the data these marginals are said to be “free”. Does this matter? Well, this means that Fisher’s is likely to be underpowered and conservative compared to tests, typically in our field the chi squared test which assume that the marginals are free. “Underpowered” means that, for the same non-null model, you will need more data to get a statistically significant finding than you would with the better test, “conservative” means that the test is likely to show a p value greater, i.e. less significant, than it should for any given findings. Having said that the non-exact nature of the chi squared test does mean that it too can be misleading if cell sizes are small (more likely if the total n is small of course). Just to make things more complicated, Fisher’s exact test can be hard to compute even for 2×2 tables when the total n is large and its exact form can be impractical to compute even with modern computer hardware for largeish n and tables larger than 2×2. That brings me to this next bit!

When is Fisher’s test not exactly an exact test? #

This is really just amusement for any practical purposes with typical datasets.

Oddly enough, the answer is: quite often! When your contingency table is bigger than 2×2 and particularly when your n is large, which can be not all that large if the table is big, then the exact p value is extremely hard to compute so most software uses a Monte Carlo approximation so you are back to an approximate test, although approximate because it is uses a simulation to get a p value, not because it uses maths of an infinitely large distribution for finite data.

So what should we do? #

Interestingly, the more I dig into this, the more I feel that the “best” test for contingency table questions is a surprisingly complicated area given that the chi squared test has been around since 1900 and Fisher’s test since 1935 (or 1933, seen next section!) Mind you, as I think our literature treats all NHST “tests” as far more definite than they are, perhaps this shouldn’t either surprise me or worry me too much. I think my approach should become this.

  • If I am doing a designed experiment where the marginals are fixed then the Fisher test is clearly the correct option.
  • Where the marginals are not fixed, as for pretty much all survey data contingency tables I suspect I should be saying a priori that I will do both a Fisher test and that you will report a chi squared test and will treat the two p values as indicative and telling us about the uncertainties arising from these issues, not treating either as definitive in a classical NHST way.
Names #

You will see what I am calling “Fisher’s exact test” referred to as “Fisher’s test” or as the “Fisher-Irwin test”. Apparently Irwin had himself understood the maths of the test in 1933 two years before Fisher described it and he also clarified the maths in a paper published in 1935.

Try also #

Chapters #

Not covered in the OMbook.

Online resources #

Dates #

First created 13.i.26.

Powered by BetterDocs