Confidence intervals for quantiles

Confidence intervals ECDFs Quantiles (quartiles, [per]centiles, deciles)

A general coverage of quantiles, the ECDF (Empirical Cumulative Distribution Function) and confidence intervals (CIs) around quantiles

Chris Evans https://www.psyctc.org/R_blog/ (PSYCTC.org)https://www.psyctc.org/psyctc/
2023-08-09
Show code
### this is just the code that creates the "copy to clipboard" function in the code blocks
htmltools::tagList(
  xaringanExtra::use_clipboard(
    button_text = "<i class=\"fa fa-clone fa-2x\" style=\"color: #301e64\"></i>",
    success_text = "<i class=\"fa fa-check fa-2x\" style=\"color: #90BE6D\"></i>",
    error_text = "<i class=\"fa fa-times fa-2x\" style=\"color: #F94144\"></i>"
  ),
  rmarkdown::html_dependency_font_awesome()
)

Introduction and background

This is one of my “background for practitioners” blog posts not one of my geeky ones. It will lead into another about two specific functions I’ve added to the CECPfuns R package. However, this one will cross-link to entries in the free online OMbook glossary.

Quick introduction to quantiles (skim through if you know this)

Quantiles are important and very useful but seriously underused. They are important both because they are useful to describe distributions of data but also because they can help us map from an individual’s data to and from collective data: one of the crucial themes in the mental health and therapy evidence base.

I recommend that you read my post about ECDFs before going further if you don’t feel familiar with quantiles: the ECDF (Empirical Cumulative Density Function) is a great introduction to quantiles.

Quick terminological point: quantiles are pretty much the same as percentiles, centiles, deciles and the median and lower and upper quartiles are specific quantiles. I’ll come back to that.

Quantiles tell us about the location of observed values within a distribution of data. Let’s start with: the Gaussian distribution (often called, but a bit misleadingly, the “Normal”. Note the capital letter: it’s not normal in any normal sense of that word!) This facetted plot shows three simulated samples from the Gaussian distribution. The sample sizes are 100, 1,000 and 10,000.

Histogram of samples from Gaussian distribution

Show code
set.seed(12345) # set seed to get the same data regardless of platform and occasion
# rnorm(10000) %>%
#   as_tibble() %>%
#   mutate(n = 10000) %>%
#   rename(score = value) -> tibGauss10k
# 
# rnorm(1000) %>%
#   as_tibble() %>%
#   mutate(n = 1000) %>%
#   rename(score = value) -> tibGauss1k
# 
# rnorm(100) %>%
#   as_tibble() %>%
#   mutate(n = 100) %>%
#   rename(score = value) -> tibGauss100
# 
# bind_rows(tibGauss100,
#           tibGauss1k,
#           tibGauss10k) -> tibGaussAll

### much more tidyverse way of doing that
c(100, 1000, 10000) %>% # set your sample sizes
  as_tibble() %>%
  rename(n = value) %>%
  ### now you are going to generate samples per value of n so rowwise()
  rowwise() %>%
  mutate(score = list(rnorm(n))) %>%
  ungroup() %>% # overrided grouping by rowwise() and unnest to get individual values
  unnest_longer(score) -> tibGauss

### get sample statistics
tibGauss %>%
  group_by(n) %>%
  summarise(min = min(score),
            median = median(score),
            mean = mean(score),
            sd = sd(score),
            lquart = quantile(score, .25),
            uquart = quantile(score, .75),
            max = max(score),
            ### and bootstrap mean (could have used parametric derivation as this is true Gaussian but I couldn't remember it!)
            CI = list(getBootCImean(score, verbose = FALSE))) %>%
  unnest_wider(CI) -> tibGaussStats

ggplot(data = tibGauss,
       aes(x = score)) +
  facet_wrap(facets = vars(n),
             nrow = 1) +
  geom_histogram(aes(y = after_stat(density))) +
  ylim(c(0, .6)) +
  ylab("Count") +
  ggtitle("Faceted histogram for three samples from Gaussian distribution",
          subtitle = "Sample sizes 100, 1,000 and 10,000")

Boxplots of samples from Gaussian distribution

Here, taking us to quantiles, are the same data as boxplots overlaid with jittered points for the actual data. This brings us to the simplest quantiles.

Show code
tibGauss %>%
  filter(n == 1000) -> tmpTibGauss

tibGaussStats %>%
  filter(n == 1000) -> tmpTib

ggplot(data = tibGauss,
       aes(x = 1, y = score)) +
  facet_wrap(facets = vars(n),
             nrow = 1) +
  geom_boxplot(notch = TRUE,
               varwidth = TRUE,
               fill = "grey80") +
  geom_jitter(width = .35, height = 0,
              alpha = .05,
             colour = "grey40") +
  ylab("Scores") +
  ggtitle("Faceted boxplot with jittered observations for three samples from Gaussian distribution",
          subtitle = "Sample sizes 100, 1,000 and 10,000, sample mean in red")

The boxplot uses three quantiles describe the box: the median locates the belt and waist across the box, and the quartiles fix the lower and upper limits of the box. What are these quantiles? The median is the score (not necessarily present in the data) that would bisect the data into two equal sized halves so it’s the value such that half the observed values lie below it and half lie above it. The lower quartile is the value such that a quarter of the observed values lie below it and three quarters above it, the upper quartile is the value such that three quarters of the sample lie below it and one quarter above it.

So we can now start to look at these names.

Quantile Quartile Percentile (a.k.a. centile)
.25 lower 25%
.50 median 50%
.75 upper 75%

Now we get to the ECDF which, like the histogram, violin plot and boxplot is another way to describe a distribution of observed data. This takes us into the richness of quantiles.

ECDF of samples from Gaussian distribution

Show code
tibGaussStats %>%
  select(n, min, lquart, median, uquart, max) %>%
  pivot_longer(cols = min:max) %>%
  rename(Quantile = name) %>%
  mutate(Quantile = ordered(Quantile,
                            levels = c("min", "lquart", "median", "uquart", "max"),
                            labels = c("Min", "Lower quartile", "Median", "Upper quartile", "Max"))) -> tibGaussStatsLong

ggplot(data = tibGauss,
       aes(x = score)) +
  facet_wrap(facets = vars(n),
             ncol = 1) +
  stat_ecdf() +
  geom_vline(data = tibGaussStatsLong,
             aes(xintercept = value, colour = Quantile)) +
  geom_text(data = tibGaussStatsLong,
            aes(label = round(value, 2),
                x = value,
                y = .28),
            nudge_x = -.04,
            hjust = 1) +
  ylab("Proportion of the sample scoring below the value") +
  ggtitle("Faceted ECDF plot for three samples from Gaussian distribution",
          subtitle = "Sample sizes 100, 1,000 and 10,000, quantiles shown as coloured lines with their values.")

I’ve changed to faceting by rows here instead of columns to give a better spread on the plot. The ECDF plots on the y axis the proportion of the sample scoring below the value on the x axis.

A few pretty obvious comments on the impact of sample size when in the classical model of random sampling from an infinitely large population. These impacts are visible in all those distribution plots above.

Show code
tibGaussStats %>% 
  select(n, lquart, median, uquart) %>%
  pivot_longer(cols = -n) %>%
  mutate(value = round(value, 4),
         proportion = case_when(
                              name == "lquart" ~ .25,
                              name == "median" ~ .5,
                              name == "uquart" ~ .75),
         popVal = qnorm(proportion),
         popVal = round(popVal, 4)) %>%
  pivot_wider(names_from = n, names_prefix = "n", values_from = value) %>%
  flextable() %>%
  autofit()

name

proportion

popVal

n100

n1000

n10000

lquart

0.25

-0.6745

-0.5901

-0.6359

-0.6652

median

0.50

0.0000

0.4837

0.0080

-0.0019

uquart

0.75

0.6745

0.9004

0.6453

0.6576

It can be seen there that the observed values for the quantiles get closer to the population values the larger the sample.

Confidence intervals for quantiles

So, as we can see in the above any observed quantile value, like any sample statistic, will have a different value for the next sample assuming any real sampling process, whether truly random (only in simulations in my view) or not. That means that, like any sample statistic, any observed quantile value can be given a confidence interval around it and this CI will be narrower the larger the sample size.

This brings us to the fact that there are various ways of computing this confidence interval. The R package quantileCI gives three methods including a bootstrap method. They’re all non-parametric, i.e. not making assumptions about the shape of the distribution of the scores for which you computed the quantiles. From a bit of reading led by Michael Höhle’s R package quantileCI (https://github.com/hoehleatsu/quantileCI) it seems to me that the Nyblom method is probably best (and the differences between the methods unlikely to cause us any headaches with typical MH/therapy score data). As ever the confidence interval gives a range around the observed value for a sample statistic that should include the population value in a given proportion of samples. The proportion usually used is 95%, i.e. a 95% confidence interval. Here they are for our sample data.

Show code
tibGauss %>%
  group_by(n) %>%
  summarise(lquartCI = list(quantileCI::quantile_confint_nyblom(score, .25)),
            medianCI = list(quantileCI::quantile_confint_nyblom(score, .5)),
            uquartCI = list(quantileCI::quantile_confint_nyblom(score, .75))) %>%
  unnest_wider(lquartCI, names_sep = ":") %>%
  unnest_wider(medianCI, names_sep = ":") %>%
  unnest_wider(uquartCI, names_sep = ":") %>%
  rename(lquartLCL = `lquartCI:1`,
         lquartUCL = `lquartCI:2`,
         medianLCL = `medianCI:1`,
         medianUCL = `medianCI:2`,
         uquartLCL = `uquartCI:1`,
         uquartUCL = `uquartCI:2`) %>%
  mutate(lquartCI = paste0(round(lquartLCL, 2), " to ", round(lquartUCL, 2)),
         medianCI = paste0(round(medianLCL, 2), " to ", round(medianUCL, 2)),
         uquartCI = paste0(round(uquartLCL, 2), " to ", round(uquartUCL, 2))) %>%
  left_join(tibGaussStats, by = "n") %>%
  mutate(lquart = round(lquart, 2),
         median = round(median, 2),
         uquart = round(uquart, 2)) %>%
  select(-c(min, mean, sd, max:UCLmean)) -> tmpTib

tmpTib %>%
  select(n, lquart, lquartCI, median, medianCI, uquart, uquartCI) %>%
  flextable() %>%
  autofit()

n

lquart

lquartCI

median

medianCI

uquart

uquartCI

100

-0.59

-0.88 to -0.31

0.48

-0.01 to 0.61

0.90

0.71 to 1.44

1,000

-0.64

-0.72 to -0.55

0.01

-0.06 to 0.08

0.65

0.58 to 0.74

10,000

-0.67

-0.69 to -0.64

0.00

-0.03 to 0.02

0.66

0.63 to 0.69

That shows the observed values for the quartiles and the median for the three samples. It’s very clear that the widths of the intervals get tighter as the sample size increases.

Plotting the ECDF with quantiles and their CIs

I’ve added the function plotQuantileCIsfromDat to the CECPfuns R package package. This creates these plots below which I like. They show any requested quantiles, here the quartiles and median, with the ECDF from the data, and plots the confidence intervals for those quantiles. Here are the plots for those three quantiles and for the the three simulated samples that we’ve been using so far.

Show code
tibGauss %>% 
  filter(n == 100) %>%
  select(score) %>% 
  pull() -> tmpVec

plotQuantileCIsfromDat(tmpVec, vecQuantiles = c(.25, .5, .75), addAnnotation = FALSE, printPlot =  FALSE, returnPlot = TRUE) -> tmpPlot100

tibGauss %>% 
  filter(n == 1000) %>%
  select(score) %>% 
  pull() -> tmpVec

plotQuantileCIsfromDat(tmpVec, vecQuantiles = c(.25, .5, .75), addAnnotation = FALSE, printPlot =  FALSE, returnPlot = TRUE) -> tmpPlot1000

tibGauss %>% 
  filter(n == 10000) %>%
  select(score) %>% 
  pull() -> tmpVec

plotQuantileCIsfromDat(tmpVec, vecQuantiles = c(.25, .5, .75), addAnnotation = FALSE, printPlot =  FALSE, returnPlot = TRUE) -> tmpPlot10000

library(patchwork)
### standardise the x axis ranges
tmpPlot100 + 
  xlim(c(-4, 4)) -> tmpPlot100
tmpPlot1000 + 
  xlim(c(-4, 4)) -> tmpPlot1000
tmpPlot10000 + 
  xlim(c(-4, 4)) -> tmpPlot10000

tmpPlot100 /
  tmpPlot1000 /
  tmpPlot10000

Unsurprisingly those plots show that the three quantiles are well separated with non-overlapping confidence intervals even for n = 100 and they show how the confidence intervals tighten as the n increases.

I hope this was a fairly clear introduction to putting confidence intervals around observed quantiles.

Visit count

free website counter code

Last updated

Show code
cat(paste(format(Sys.time(), "%d/%m/%Y"), "at", format(Sys.time(), "%H:%M")))
18/04/2024 at 10:18

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Evans (2023, Aug. 9). Chris (Evans) R SAFAQ: Confidence intervals for quantiles. Retrieved from https://www.psyctc.org/R_blog/posts/2023-08-09-confidence-intervals-for-quantiles/

BibTeX citation

@misc{evans2023confidence,
  author = {Evans, Chris},
  title = {Chris (Evans) R SAFAQ: Confidence intervals for quantiles},
  url = {https://www.psyctc.org/R_blog/posts/2023-08-09-confidence-intervals-for-quantiles/},
  year = {2023}
}