Chris (Evans) R SAFAQ: Autocorrelation (2)

Chris Evans

I did this largely to give a worked example to complement a very short entry in the glossary of the OMbook: here. If autocorrelation is completely new to you it’s probably worth looking at that short glossary entry before reading on here.

Show code

### this is just the code that creates the "copy to clipboard" function in the code blocks
htmltools::tagList(
  xaringanExtra::use_clipboard(
    button_text = "<i class=\"fa fa-clone fa-2x\" style=\"color: #301e64\"></i>",
    success_text = "<i class=\"fa fa-check fa-2x\" style=\"color: #90BE6D\"></i>",
    error_text = "<i class=\"fa fa-times fa-2x\" style=\"color: #F94144\"></i>"
  ),
  rmarkdown::html_dependency_font_awesome()
)

Show code

if (interactive()) {
  read_csv("./_posts/2026-01-15-autocorrelation2/autocorrelation2.csv") -> tibDat
} else {
  read_csv("autocorrelation2.csv") -> tibDat
}

### now get the first and last dates with usable pulse rates
tibDat %>%
  mutate(Date = as.Date(Dates, format = "%d/%m/%y")) -> tibDat

tibDat %>%
  filter(!is.na(minPR)) %>%
  summarise(first = min(Date),
            last = max(Date)) -> tmpTib

### and filter to within those dates
tibDat %>%
  filter(Date >= tmpTib$first & Date <= tmpTib$last) %>%
  mutate(dayN = row_number()) -> tibDat

### fill in any dates that were entirely missing in the data
tibble(Date = seq(tmpTib$first, tmpTib$last)) %>%
  left_join(tibDat, by = "Date") -> tibDat

I’m illustrating autocorrelation from a short dataset I created for my own amusement after buying a cheap watch that would tell me my waking and lowest overnight pulse rates. The data I’m using run from 2024-08-07 to 2026-01-15. Here are the simple descriptive statistics.

Show code

### get summary statistics
tibDat %>%
  summarise(first = min(Date),
            last = max(Date),
            n = n(),
            PRnNA = getNNA(PulseRate),
            PRnOK = n - PRnNA,
            PRmin = min(PulseRate, na.rm = TRUE),
            PRmean = mean(PulseRate, na.rm = TRUE),
            PRmedian = median(PulseRate, na.rm = TRUE),
            PRmax = max(PulseRate, na.rm = TRUE),
            PRsd = sd(PulseRate, na.rm = TRUE),
            minPRnNA = getNNA(minPR),
            minPRnOK = n - minPRnNA,
            minPRmin = min(minPR, na.rm = TRUE),
            minPRmean = mean(minPR, na.rm = TRUE),
            minPRmedian = median(minPR, na.rm = TRUE),
            minPRmax = max(minPR, na.rm = TRUE),
            minPRsd = sd(minPR, na.rm = TRUE)) -> tibStats

tibStats %>%
  select(-c(first, last)) %>%
  pivot_longer(cols = everything(),
               names_to = "Statistic") %>%
  ### create grouping
  mutate(whichPR = if_else(str_detect(Statistic, fixed("minPR")),
                           "Min overnight PR",
                           "Waking PR"),
         whichPR = if_else(Statistic == "n", 
                           "Overall",
                           whichPR)) %>%
  as_grouped_data(groups = "whichPR") %>%
  flextable() %>%
  colformat_double(digits = 2) %>%
  autofit()

whichPR	Statistic	value
Overall
	n	527.00
Waking PR
	PRnNA	71.00
	PRnOK	456.00
	PRmin	51.00
	PRmean	64.06
	PRmedian	62.00
	PRmax	95.00
	PRsd	7.37
Min overnight PR
	minPRnNA	78.00
	minPRnOK	449.00
	minPRmin	45.00
	minPRmean	52.42
	minPRmedian	52.00
	minPRmax	62.00
	minPRsd	2.69

This plot shows the waking pulse values.

Show code

ggplot(data = tibDat,
       aes(x = Date, y = PulseRate)) +
  geom_point() +
  geom_line() +
  geom_smooth() +
  geom_hline(yintercept = tibStats$PRmean,
             linetype = 3) +
  geom_linerange(data = filter(tibDat,
                               is.na(PulseRate)),
                 aes(x = Date,
                     ymin = -Inf,
                     ymax = Inf),
                     colour = "red") +
  ylab("Waking pulse rate") +
  scale_x_date(date_breaks = "1 month", date_labels =  "%b %Y") +
  theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
  ggtitle("Waking pulse rates")

The red lines mark days when for some reason I don’t have a value for my waking pulse, the black points are do the values where I have them (doh!) and the blue line is the “loess” smoothed curve fitting the data and the shaded area around that is its 95% confidence interval. The dashed horizontal line is the arithmetic mean of the values.

That shows a rather depressing but realistic decline in my fitness!

Here’s the histogram of the values.

Show code

tibble(x = seq(tibStats$PRmin, tibStats$PRmax, 1)) %>%
  mutate(y = dnorm(x, mean = tibStats$PRmean, sd = tibStats$PRsd)) -> tibNormFitPR

ggplot(data = tibDat,
       aes(x = PulseRate)) +
  geom_histogram(aes(x = PulseRate,
                     after_stat(density)),
                 bins = 20) +
  geom_vline(xintercept = tibStats$PRmean,
             colour = "blue") +
  geom_vline(xintercept = tibStats$PRmedian,
             colour = "green") +
  geom_line(data = tibNormFitPR,
            aes(x = x, y = y),
            colour = "red") +
  xlab("Waking pulse rate") +
  ggtitle("Histogram of waking pulse rates")

The blue line marks the mean value and the green line the median, the red curve is the best fitting Gaussian distribution. It’s easy to see that the data are markedly right skew and not Gaussian!

Show code

ggplot(data = tibDat,
       aes(x = Date, y = minPR)) +
  geom_point() +
  geom_line() +
  geom_smooth() +
  geom_hline(yintercept = tibStats$minPRmean,
             linetype = 3) +
  geom_linerange(data = filter(tibDat,
                               is.na(minPR)),
                 aes(x = Date,
                     ymin = -Inf,
                     ymax = Inf,
                     colour = "red")) +
  ylab("Minimum overnight pulse rate") +
  scale_x_date(date_breaks = "1 month", date_labels =  "%b %Y") +
  theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
  ggtitle("Minimum overnight pulse rates")

Hm. That’s a surprisingly different picture isn’t it?

Here’s the histogram for those values.

Show code

tibble(x = seq(tibStats$minPRmin, tibStats$minPRmax, .1)) %>%
  mutate(y = dnorm(x, mean = tibStats$minPRmean, sd = tibStats$minPRsd)) -> tibNormFitPR

ggplot(data = tibDat,
       aes(x = minPR)) +
  geom_histogram(aes(x = minPR,
                     after_stat(density)),
                 bins = 20) +
  geom_vline(xintercept = tibStats$minPRmean,
             colour = "blue") +
  geom_vline(xintercept = tibStats$minPRmedian,
             colour = "green") +
  geom_line(data = tibNormFitPR,
            aes(x = x, y = y),
            colour = "red") +
  xlab("Minimum overnight pulse rate") +
  ggtitle("Histogram of minimum overnight pulse rates")

Clearly on the data we have these data are not a bad fit to a Gaussian distribution. I wonder if that’s because the minimum is pretty fixed but varying, perhaps fairly randomly, day by day whereas the waking rate is going to be a function of my point in the sleep cycle when the alarm goes off and hence can be quite high but also down near the minimum. Probably showing how little I know about sleep!

Autocorrelation

OK, now we have a sense of the data and we can look at the autocorrelation.

Interpolation

However, as we have missing values we have no choice but to interpolate replacement values and the only sensible way to do this, to my mind, is linear interpolation. Linear interpolation is applied for each run of missing values. In linear interpolation the replacement value is the mean of the values either side of it if there is only one missing value, if there are two the first is replaced with the value that is one third of the sum of the values either side and the second is replaced with the value two thirds of the sum of the values either side. However many values are missing they are being replaced with values lying along a straight line from the non-missing value before the first missing value and and the value after the last missing value in that block of missing values.

This shows the interpolation of the waking pulse rates across the entire dataset.

Show code

### do the interpolation across all data
x <- zoo(tibDat$PulseRate, tibDat$dayN)
x <- as.ts(x, start = 1, frequency = 1)
xInterpolated <- na.interp(x)

### get that to a tibble and create dayN
tibble(interpolatedPR = as.numeric(xInterpolated)) %>%
  mutate(dayN = row_number()) -> tmpTib

tibDat %>%
  select(Date, dayN, PulseRate, PulseNA, minPR) %>%
  mutate(n = n(),
         Missing = if_else(!PulseNA, "Not missing", "Interpolated"),
         ### now identify the pre-missing 
         Missing = if_else((row_number() != n & # not the last observation
                              !is.na(PulseRate) & 
                              is.na(lead(PulseRate))), 
                           "PreMissing", 
                           Missing),
         ### I know that the dataset starts with a non-missing value so this is safe
         Missing = if_else(row_number() > 1 & # not the first observation
                             !is.na(PulseRate) &
                             is.na(lag(PulseRate)), 
                           "PostMissing", 
                           Missing)) -> tibDat

### merge in the interpolated values
tibDat %>%
  left_join(tmpTib, by = "dayN") -> tmpTib

ggplot(data = tmpTib,
       aes(x = Date, y = interpolatedPR)) +
  # geom_point(aes(y = interpolatedPR,
  #                colour = Missing),
  #            shape = "+",
  #            size = 3) +
  geom_point(aes(colour = Missing)) +
  geom_line(alpha = .2) +
  ylab("Waking pulse rate, measured or interpolated") +
  ggtitle("Plot demonstrating interpolation of missing pulse rates",
          subtitle = "Colour shows whether value is interpolated, pre or post missing or observed")

OK, now we have complete, if repaired, data we can look at the ACF.

Waking pulse rate

I have used the ggtsdisplay() function out of the forecast package to get this plot.

Show code

acf(xInterpolated, plot = FALSE, lag.max = 10)$acf -> vecACFwaking # numbers off 1 to 11 with #1 always 1.0
pacf(xInterpolated, plot = FALSE, lag.max = 10)$acf -> vecPACFwaking # numbers off 1 to 10, i.e. starting with lag 2

ggtsdisplay(xInterpolated, lag.max = 10)

The top plot is just the interpolation repaired waking pulse data but the x axis is the day count not the actual date. (I’m being lazy and haven’t explored whether I could get that back to labelling by date, I suspect it would be tricky). The crucial plots are the two below that. Their y axes are correlation values and the blue dashed lines use some statistical theory to show the values above or below which we might regard them as statistically significant with p < .05. The x axes are the lag values so they start at 1 for the simple lag 1. Because there is a multiple tests issues here, oversimplifying a very little, lags above 2 that look to be statistically significant, i.e. they are outside those blue dashed bounds are highly unlikely to be meaningful.

So that’s the ACF, what’s the PACF? This is the partial ACF and it’s the correlation after partialling out all of the the earlier lags.

Minimum overnight pulse rate

Here is the same for the minimum overnight pulse rates.

Show code

### do the interpolation across all data
x <- zoo(tibDat$minPR, tibDat$dayN)
x <- as.ts(x, start = 1, frequency = 1)
xInterpolated <- na.interp(x)

acf(xInterpolated, plot = FALSE, lag.max = 10)$acf -> vecACFminPR # numbers off 1 to 11 with #1 always 1.0
pacf(xInterpolated, plot = FALSE, lag.max = 10)$acf -> vecPACFminPR # numbers off 1 to 10, i.e. starting with lag 2

ggtsdisplay(xInterpolated, lag.max=10)

That shows a rather different picture from that for the waking pulse rates. All the simple ACF correlations up to lag 10 look to be statistically significant and the lag 1 ACF correlation is 0.60 which is pretty high. Partialling lag 1 out still leaves the PACF for lag 2 significant at 0.30. In fact, in marked contrast to my data about a year ago there seems to be just a steady decline in the partial correlations down 0.11 for lag 7.

Summary

Well I guess this shows that more data, and time dependent processes like loss of fitness with age (and with other challenges!) can change things.

Visit count

website counter

Last updated

Show code

cat(paste(format(Sys.time(), "%d/%m/%Y"), "at", format(Sys.time(), "%H:%M")))

22/04/2026 at 16:01

Autocorrelation (2)

Autocorrelation

Interpolation

Waking pulse rate

Minimum overnight pulse rate

Summary

Last updated

Reuse

Citation