Distribution overlays in ggplot: after_stat(), stat_function() and facets

ggplot Distributions Graphics Plots and graphs

There are many ways of overlaying a curve on a plot, typically a Gaussian on a histogram. This is my lastest best way!

Chris Evans https://www.psyctc.org/R_blog/ (PSYCTC.org)https://www.psyctc.org/psyctc/
2025-05-04
Show code
### this is just the code that creates the "copy to clipboard" function in the code blocks
htmltools::tagList(
  xaringanExtra::use_clipboard(
    button_text = "<i class=\"fa fa-clone fa-2x\" style=\"color: #301e64\"></i>",
    success_text = "<i class=\"fa fa-check fa-2x\" style=\"color: #90BE6D\"></i>",
    error_text = "<i class=\"fa fa-times fa-2x\" style=\"color: #F94144\"></i>"
  ),
  rmarkdown::html_dependency_font_awesome()
)

Introduction

In an earlier post here (Explore distributions with plots) I superimposed a Gaussian distribution curve on a histogram and the method used there, of creating a tibble with a Gaussian distribution using dnorm() and then superimposing a plot of that distribution on the histogram does work but is a bit tedious if you want to facet your plot. This method does adapt to facetting fine using stat_function() and reminds me about using after_stat() rather than the denigrated double dot “..density..” syntax (many internet pages about using ggplot to superimpose distribution lines on plots still show the ..density.. syntax).

It started with this code which I used to create an OMbook glossary entry: “The” seven number summary.

Show code
### set sample size
n <- 5000
### set seed for reproducible results
set.seed(12345)
### create tibble of data
tibble(x = rnorm(n, mean = 10)) %>%
  ### mutate to cube those values
  mutate(xcubed = x^3,
         ### and transform to mean 10 and SD 1
         xcubed = 10 + (xcubed - mean(xcubed)) / sd(xcubed)) -> tmpTib

### pivot longer to get in form that's nice for facetting
tmpTib %>%
  pivot_longer(cols = everything(), names_to = "variable") -> tmpTibLong


### this is a key bit for use of stat_function(): 
### get  the statistics for each variable
tmpTibLong %>%
  group_by(variable) %>%
  summarise(mean = mean(value),
            median = median(value),
            sd = sd(value)) %>%
  ungroup() -> tmpTibStats


### now the plot!
ggplot(data = tmpTibLong,
       aes(x = value)) +
  ### facet by variable of interest
  facet_grid(rows = vars(variable)) +
  ### now use after_stat() to get density not count
  ### that's sensible given the aim to superimpose the
  ### Gaussian distribution curve fitting each variable
  geom_histogram(aes(y = after_stat(density)),
                 bins = 50,
                 fill = "grey") +
  ### now use stat_function() to generate the curves 
  ### using the means and sds per variable 
  ### and fitting with the facetting
  stat_function(data = tmpTibStats,
                inherit.aes = FALSE,
                fun = dnorm,
                ### one thing that surprised me was that I had to use
                ### the explicit "tmpTibStats$" here
                args = list(mean = tmpTibStats$mean,
                            sd = tmpTibStats$sd),
                ### this tells stat_function() the number of points to 
                ### create
                            n = 1500,
                colour = "red") +
  ### very ordinary stuff from here on!
  geom_vline(data = tmpTibStats,
             aes(xintercept = mean),
             colour = "green") +
  geom_vline(data = tmpTibStats,
             aes(xintercept = median),
             colour = "blue") +
  ggtitle("Histograms of Gaussian and standardised cubed Gaussian distributions",
          subtitle = "Blue reference lines mark means, green lines mark medians")

There are a couple of slightly odd things to me in there:

Show code
### find the limits of the values in the datasets to know what range of 
### of Gaussian density values you need to generate
tmpTibLong %>%
  summarise(minX = min(value),
            maxX = max(value)) -> tmpTibXlimits
### how many points do you want
### I have chose this to match what I had in stat_function() above
nPoints <- 1500

### now generate those density values for each variable
tmpTibStats %>%
  ### for each valuable (not really necessary here but would be in general
  ### where we would be comparing different variables not ones standardised
  ### to have essentially the same mean and SD)
  group_by(variable) %>%
  mutate(xVals = list(seq(tmpTibXlimits$minX, 
                          tmpTibXlimits$maxX, 
                          length.out = nPoints))) %>%
  ungroup() %>%
  ### unnest to get in long format
  unnest_longer(xVals) %>%
  ### prune to only the variables we need
  select(variable, mean, sd, xVals) %>%
  ### generate the Gaussian density values
  ### using the observed mean and SD for each variable
  mutate(dGauss = dnorm(xVals, 
                        mean = mean, 
                        sd = sd)) -> tmpTibLineValues


### now do the plot!
ggplot(data = tmpTibLong,
       aes(x = value)) +
  ### facet by variable of interest
  facet_grid(rows = vars(variable)) +
  ### now use after_stat() to get density not count
  ### that's sensible given the aim to superimpose the
  ### Gaussian distribution curve fitting each variable
  geom_histogram(aes(y = after_stat(density)),
                 bins = 50,
                 fill = "grey") +
  ### this replaces the stat_function() call
  ### and uses the Gaussian data generated above
  ### to get the density lines per facet
  geom_line(data = tmpTibLineValues,
            aes(x = xVals, y = dGauss),
            colour = "red") +
  ### very ordinary stuff from here on!
  geom_vline(data = tmpTibStats,
             aes(xintercept = mean),
             colour = "green") +
  geom_vline(data = tmpTibStats,
             aes(xintercept = median),
             colour = "blue") +
  ggtitle("Histograms of Gaussian and standardised cubed Gaussian distributions",
          subtitle = "Blue reference lines mark means, green lines mark medians")

Summary/moral

There are many ways to do most things in R and sometimes I forget the one I used last and keep looking it up and finding outdated information on the internet so here’s this wityh two slightly cosmetically different ways of superimposing a density curve on a histogram.

History

Visit count

web counter

Last updated

Show code
cat(paste(format(Sys.time(), "%d/%m/%Y"), "at", format(Sys.time(), "%H:%M")))
04/05/2025 at 12:37

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Evans (2025, May 4). Chris (Evans) R SAFAQ: Distribution overlays in ggplot: after_stat(), stat_function() and facets. Retrieved from https://www.psyctc.org/R_blog/posts/2025-05-04-ggplot-overlays-afterstat-and-statfunction/

BibTeX citation

@misc{evans2025distribution,
  author = {Evans, Chris},
  title = {Chris (Evans) R SAFAQ: Distribution overlays in ggplot: after_stat(), stat_function() and facets},
  url = {https://www.psyctc.org/R_blog/posts/2025-05-04-ggplot-overlays-afterstat-and-statfunction/},
  year = {2025}
}