# More piping, and rowwise()

This extends https://www.psyctc.org/Rblog/posts/2021-02-07-why-pipe-why-the-tidyverse/ and introduces rowwise()

Chris Evans https://www.psyctc.org/R_blog/ (PSYCTC.org)https://www.psyctc.org/psyctc/
2021-02-10

This is a slight adaptation of a file I did for Emily (https://www.researchgate.net/profile/Emily_Blackshaw2) back in October 2020 when she and wanted to look at whether Cronbach’s alpha for the YP-CORE varied from session to session across help-seeking clients data: a very basic exploration of longitudinal measurement invariance. I realised it was a good chance for me to pull together what I had been learning back then about piping and to share it with her.

As a page here it probably should have come before https://www.psyctc.org/Rblog/posts/2021-02-07-why-pipe-why-the-tidyverse/, or been woven into that, but I had managed to lose the file (worrying). However, I think it complements what I put in there and it does introduce the rowwise() function and c_across().

As is my wont, I prefer to explore methods with simulated data so the first step was to make such data. Here I am simulating 500 clients each having ten sessions and just a five item questionnaire (the YP-CORE has ten items but five is quicker and fits output more easily!)

Show code
``````### make some nonsense data
library(tidyverse)
nParticipants <- 500
nSessions <- 10
session <- rep(1:nSessions, nParticipants) # 1,2,3 ...10, 1,23 ...10 ...
session %>%
as_tibble() %>%  # turn from vector to tibble, that means I have rename it back to the vector name!
rename(session = value) %>%
mutate(baseVar = rnorm(nParticipants*nSessions),  # this creates a new variable in the tibble and sort of reminds me that variables may be vectors
item1 = baseVar + 0.7*rnorm(nParticipants*nSessions), # creates a first item
item2 = baseVar + 0.7*rnorm(nParticipants*nSessions), # and a second
item3 = baseVar + 0.7*rnorm(nParticipants*nSessions), # and a third
item4 = baseVar + 0.7*rnorm(nParticipants*nSessions), # and a 4th ...
item5 = baseVar + 0.7*rnorm(nParticipants*nSessions)) -> tmpDat

### look at it
tmpDat
``````
``````# A tibble: 5,000 x 7
session  baseVar   item1  item2   item3   item4  item5
<int>    <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>
1       1  0.00631 -0.194  -0.280  0.0798 -0.331  -0.420
2       2  2.71     2.39    2.94   2.15    3.45    1.64
3       3 -0.868   -0.413  -0.709 -0.938  -1.45   -0.662
4       4 -0.601   -1.51   -1.43  -0.177  -0.541  -0.979
5       5 -0.478   -1.26   -1.22   0.0396  0.667  -0.668
6       6  0.449    0.517   1.06   0.526   0.645  -0.695
7       7 -0.0757   0.413   0.868  0.163  -0.549   0.192
8       8 -0.599    0.0248 -1.97  -0.0925 -1.53    0.944
9       9 -0.0810   0.576   0.830 -0.121  -0.0841 -0.503
10      10  1.79     0.944   1.41   2.31    1.28    1.97
# … with 4,990 more rows``````
Show code
``````### check the simple correlation
cor(tmpDat[, 3:7])
``````
``````          item1     item2     item3     item4     item5
item1 1.0000000 0.6799230 0.6699420 0.6748932 0.6769445
item2 0.6799230 1.0000000 0.6661796 0.6793697 0.6839315
item3 0.6699420 0.6661796 1.0000000 0.6631769 0.6702323
item4 0.6748932 0.6793697 0.6631769 1.0000000 0.6729270
item5 0.6769445 0.6839315 0.6702323 0.6729270 1.0000000``````
Show code
``````### OK, I can play with that, here's the overall alpha (meaningless even for the simulation really but just checking)
psychometric::alpha(tmpDat[, 3:7])
``````
``[1] 0.9117041``

OK. Now I could start playing with the data in the tidyverse/dplyr/piping way. The key thing to remember is that the default behaviour of mutate() or summarise() within group_by() in dplyr is for a function to act on a vertical vector, i.e. on a variable

Show code
``````tmpDat %>%
group_by(session) %>%
summarise(mean1 = mean(item1))
``````
``````# A tibble: 10 x 2
session    mean1
*   <int>    <dbl>
1       1 -0.0335
2       2 -0.00275
3       3  0.0434
4       4 -0.0244
5       5  0.0716
6       6  0.0271
7       7  0.103
8       8 -0.0995
9       9 -0.0311
10      10 -0.0835 ``````

So that simply got us the mean for item1 across all completions but broken down by session. Trivial dplyr/piping but I still find it satisfying in syntax and in its utility.

As introduced in https://www.psyctc.org/Rblog/posts/2021-02-07-why-pipe-why-the-tidyverse/, if I have a function that returns more than one value dplyr handles this nicely but I have to tell it the function is creating a list (even if it’s just a vector), as below. The catch to remember is that you then have to unnest() the list to see its values, usually unnest_wider() is what I want but there is unnest_longer().

Show code
``````tmpDat %>%
group_by(session) %>%
summarise(summary1 = list(summary(item1))) %>%
unnest_wider(summary1)
``````
``````# A tibble: 10 x 7
session  Min. `1st Qu.`   Median     Mean `3rd Qu.`  Max.
<int> <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <dbl>
1       1 -4.20    -0.941 -0.0607  -0.0335      0.847  4.29
2       2 -4.44    -0.853  0.0633  -0.00275     0.816  3.24
3       3 -3.34    -0.791  0.0556   0.0434      0.840  4.36
4       4 -3.74    -0.816 -0.0150  -0.0244      0.744  3.68
5       5 -3.26    -0.787  0.103    0.0716      1.02   3.61
6       6 -3.55    -0.731 -0.00969  0.0271      0.767  3.32
7       7 -3.32    -0.724  0.0880   0.103       0.959  4.52
8       8 -4.14    -0.910 -0.111   -0.0995      0.746  3.13
9       9 -4.62    -0.815 -0.0212  -0.0311      0.810  3.34
10      10 -3.88    -0.908 -0.0993  -0.0835      0.691  3.60``````
Show code
``````###  names are messy but it is easy to solve that ...

tmpDat %>%
group_by(session) %>%
summarise(summary1 = list(summary(item1))) %>%
unnest_wider(summary1) %>%
###  sometimes you have to clean up names that start
###  with numbers or include spaces if you want to avoid backtick quoting
rename(Q1 = `1st Qu.`,
Q3 = `3rd Qu.`)
``````
``````# A tibble: 10 x 7
session  Min.     Q1   Median     Mean    Q3  Max.
<int> <dbl>  <dbl>    <dbl>    <dbl> <dbl> <dbl>
1       1 -4.20 -0.941 -0.0607  -0.0335  0.847  4.29
2       2 -4.44 -0.853  0.0633  -0.00275 0.816  3.24
3       3 -3.34 -0.791  0.0556   0.0434  0.840  4.36
4       4 -3.74 -0.816 -0.0150  -0.0244  0.744  3.68
5       5 -3.26 -0.787  0.103    0.0716  1.02   3.61
6       6 -3.55 -0.731 -0.00969  0.0271  0.767  3.32
7       7 -3.32 -0.724  0.0880   0.103   0.959  4.52
8       8 -4.14 -0.910 -0.111   -0.0995  0.746  3.13
9       9 -4.62 -0.815 -0.0212  -0.0311  0.810  3.34
10      10 -3.88 -0.908 -0.0993  -0.0835  0.691  3.60``````

Again, as I introduced in https://www.psyctc.org/Rblog/posts/2021-02-07-why-pipe-why-the-tidyverse/, I can extend this to handle more than one vector/variable at a time if they’re similar and I’m doing the same to each.

Show code
``````tmpDat %>%
group_by(session) %>%
summarise(across(starts_with("item"), ~mean(.x)))
``````
``````# A tibble: 10 x 6
session    item1    item2   item3    item4    item5
*   <int>    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
1       1 -0.0335   0.0376   0.0144 -0.0350  -0.00901
2       2 -0.00275  0.0188   0.0404  0.00395  0.0376
3       3  0.0434   0.0526   0.0794 -0.0348   0.00194
4       4 -0.0244  -0.0320  -0.0594 -0.0144  -0.0160
5       5  0.0716   0.0197   0.0151  0.0777   0.0842
6       6  0.0271  -0.00327  0.0127  0.0834   0.00580
7       7  0.103    0.0951   0.0690  0.0499   0.0918
8       8 -0.0995  -0.0370  -0.0687 -0.106   -0.102
9       9 -0.0311  -0.0550   0.0211 -0.0338   0.0361
10      10 -0.0835  -0.0144  -0.0844 -0.0772  -0.0163 ``````

I can also do that with the following syntax. I have not yet really understood why the help for `across()` gives that one with function syntax (“~”) and the explicit call of ".x) rather than this and I really ought to get my head around the pros and cons of each.

Show code
``````tmpDat %>%
group_by(session) %>%
summarise(across(starts_with("item"), mean))
``````
``````# A tibble: 10 x 6
session    item1    item2   item3    item4    item5
*   <int>    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
1       1 -0.0335   0.0376   0.0144 -0.0350  -0.00901
2       2 -0.00275  0.0188   0.0404  0.00395  0.0376
3       3  0.0434   0.0526   0.0794 -0.0348   0.00194
4       4 -0.0244  -0.0320  -0.0594 -0.0144  -0.0160
5       5  0.0716   0.0197   0.0151  0.0777   0.0842
6       6  0.0271  -0.00327  0.0127  0.0834   0.00580
7       7  0.103    0.0951   0.0690  0.0499   0.0918
8       8 -0.0995  -0.0370  -0.0687 -0.106   -0.102
9       9 -0.0311  -0.0550   0.0211 -0.0338   0.0361
10      10 -0.0835  -0.0144  -0.0844 -0.0772  -0.0163 ``````

Again, as I introduced in https://www.psyctc.org/Rblog/posts/2021-02-07-why-pipe-why-the-tidyverse/, I can do multiple functions of the same items

Show code
``````tmpDat %>%
group_by(session) %>%
summarise(across(starts_with("item"), list(mean = mean, sd = sd)))
``````
``````# A tibble: 10 x 11
session item1_mean item1_sd item2_mean item2_sd item3_mean item3_sd
*   <int>      <dbl>    <dbl>      <dbl>    <dbl>      <dbl>    <dbl>
1       1   -0.0335      1.25    0.0376      1.25     0.0144     1.22
2       2   -0.00275     1.28    0.0188      1.21     0.0404     1.21
3       3    0.0434      1.26    0.0526      1.25     0.0794     1.26
4       4   -0.0244      1.20   -0.0320      1.20    -0.0594     1.18
5       5    0.0716      1.25    0.0197      1.25     0.0151     1.21
6       6    0.0271      1.12   -0.00327     1.18     0.0127     1.14
7       7    0.103       1.25    0.0951      1.22     0.0690     1.20
8       8   -0.0995      1.25   -0.0370      1.25    -0.0687     1.23
9       9   -0.0311      1.18   -0.0550      1.23     0.0211     1.21
10      10   -0.0835      1.18   -0.0144      1.24    -0.0844     1.19
# … with 4 more variables: item4_mean <dbl>, item4_sd <dbl>,
#   item5_mean <dbl>, item5_sd <dbl>``````

I like that that names things sensibly

I said the default behaviour of mutate() and summarise() is to work on variables, i.e. vectors, whether that is to work on all the values of the variable if there is no group_by(), or within the groups if there is a grouping. If I want to do something on individual values, i.e. by rows, “rowwise”, then I have to use `rowwise()` which basically treats each row as a group.

If, as you often will in that situation, you want to use a function of more than one value, i.e. values from more than one variable, then you have to remember to use `c_across()` now, not `across()`: “c_” as it’s by column.

You also have to remember to `ungroup()` after any `mutate()` as you probably don’t want future functions to handle things one row at a time.

Show code
``````tmpDat %>%
filter(row_number() < 6) %>% # just for this example
rowwise() %>%
mutate(mean = mean(c_across(starts_with("item")))) %>%
ungroup() # see above about ungrouping after rowwise() and mutate()
``````
``````# A tibble: 5 x 8
session  baseVar  item1  item2   item3  item4  item5   mean
<int>    <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
1       1  0.00631 -0.194 -0.280  0.0798 -0.331 -0.420 -0.229
2       2  2.71     2.39   2.94   2.15    3.45   1.64   2.51
3       3 -0.868   -0.413 -0.709 -0.938  -1.45  -0.662 -0.835
4       4 -0.601   -1.51  -1.43  -0.177  -0.541 -0.979 -0.926
5       5 -0.478   -1.26  -1.22   0.0396  0.667 -0.668 -0.488``````

OK, so that’s recapped these things, now what about if I want to look at multiple columns and multiple rows? the trick seems to be `cur_data()`.

That gives me a sensible digression from Cronbach’s alpha here as I often find I’m wanting to get correlation matrices when I’m wanting to get alpha (and its CI) and I think getting correlation matrices from grouped data ought to be much easier than it is!

Show code
``````tmpDat %>%
select(item1:item5) %>%
summarise(cor = list(cor(cur_data()))) %>%
unnest_wider(cor)
``````
``````# A tibble: 1 x 25
...1  ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 ...10 ...11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 0.680 0.670 0.675 0.677 0.680     1 0.666 0.679 0.684 0.670
# … with 14 more variables: ...12 <dbl>, ...13 <dbl>, ...14 <dbl>,
#   ...15 <dbl>, ...16 <dbl>, ...17 <dbl>, ...18 <dbl>, ...19 <dbl>,
#   ...20 <dbl>, ...21 <dbl>, ...22 <dbl>, ...23 <dbl>, ...24 <dbl>,
#   ...25 <dbl>``````

That, as you can see, is a right old mess!

but we can use `correlate()` from the corrr package:

Show code
``````tmpDat %>%
select(item1:item5) %>%
corrr::correlate()
``````
``````# A tibble: 5 x 6
term   item1  item2  item3  item4  item5
<chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 item1 NA      0.680  0.670  0.675  0.677
2 item2  0.680 NA      0.666  0.679  0.684
3 item3  0.670  0.666 NA      0.663  0.670
4 item4  0.675  0.679  0.663 NA      0.673
5 item5  0.677  0.684  0.670  0.673 NA    ``````

As you see, `corrr::correlate()` puts NA in the leading diagonal not 1.0. That does make finding the maximum off diagonal correlations easy but I confess it seems wrong to me!

What about using that and `group_by()`?

Show code
``````tmpDat %>%
select(-baseVar) %>%
group_by(session) %>%
corrr::correlate()
``````
``````# A tibble: 6 x 7
term     session   item1   item2   item3   item4    item5
<chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
1 session NA       -0.0146 -0.0152 -0.0200 -0.0114 -0.00498
2 item1   -0.0146  NA       0.680   0.670   0.675   0.677
3 item2   -0.0152   0.680  NA       0.666   0.679   0.684
4 item3   -0.0200   0.670   0.666  NA       0.663   0.670
5 item4   -0.0114   0.675   0.679   0.663  NA       0.673
6 item5   -0.00498  0.677   0.684   0.670   0.673  NA      ``````

Hm, that completely ignores the group_by() and includes session variable. That seems plain wrong to me. I feel sure this is something the package will eventually change but for now I need another way to get what I want.

Show code
``````tmpDat %>%
select(-baseVar) %>%
group_by(session) %>%
corrr::correlate(cur_data())
``````

I have not evaluated that as it stops with the moderately cryptic error message which I’m putting in here as I quite often forget the `summarise(x = )` bit

``````# Error: `cur_data()` must only be used inside dplyr verbs.
# Run `rlang::last_error()` to see where the error occurred.``````

So let’s fix that.

Show code
``````tmpDat %>%
select(-baseVar) %>%
group_by(session) %>%
summarise(cor = corrr::correlate(cur_data()))
``````
``````# A tibble: 50 x 2
# Groups:   session [10]
session cor\$term \$item1 \$item2 \$item3 \$item4 \$item5
<int> <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1       1 item1    NA      0.711  0.711  0.715  0.678
2       1 item2     0.711 NA      0.704  0.699  0.714
3       1 item3     0.711  0.704 NA      0.700  0.677
4       1 item4     0.715  0.699  0.700 NA      0.693
5       1 item5     0.678  0.714  0.677  0.693 NA
6       2 item1    NA      0.669  0.670  0.690  0.679
7       2 item2     0.669 NA      0.644  0.679  0.680
8       2 item3     0.670  0.644 NA      0.664  0.704
9       2 item4     0.690  0.679  0.664 NA      0.679
10       2 item5     0.679  0.680  0.704  0.679 NA
# … with 40 more rows``````

Hm. That does get me the analyses I want but in what is, to my mind, a very odd structure.

OK, after that digression into the corrr package, let’s get to what Emily actually wanted: Cronbach’s alpha across the items but per session.

Show code
``````tmpDat %>%
select(-baseVar) %>%
group_by(session) %>%
summarise(alpha = psychometric::alpha(cur_data()))
``````
``````# A tibble: 10 x 2
session alpha
*   <int> <dbl>
1       1 0.921
2       2 0.912
3       3 0.919
4       4 0.903
5       5 0.908
6       6 0.901
7       7 0.911
8       8 0.912
9       9 0.910
10      10 0.915``````

I get my CI around alpha using the following code.

Show code
``````psychometric::alpha(tmpDat[, 3:7])
``````
``[1] 0.9117041``
Show code
``````getAlphaForBoot <- function(dat, i) {
# a little function that happens to use psych::alpha to get alpha
# but indexes it with i as boot() will require
psychometric::alpha(na.omit(dat[i,]))
}
getAlphaForBoot(tmpDat[, 3:7], 1:nrow(tmpDat)) # just checking that it works
``````
``[1] 0.9117041``
Show code
``````bootReps <- 1000
getCIAlphaDF3 <- function(dat, ciInt = .95, bootReps = 1000) {
tmpRes <- boot::boot(na.omit(dat), getAlphaForBoot, R = bootReps)
tmpCI <- boot::boot.ci(tmpRes, conf = ciInt, type = "perc")\$percent[4:5]
return(data.frame(alpha = tmpRes\$t0,
LCL = tmpCI[1],
UCL = tmpCI[2]))
}
``````
``````      alpha       LCL      UCL
1 0.9117041 0.9079198 0.915301``````

Actually, now I have my CECPfuns package I create a better, more robust function for this, but later!

So that’s the overall Cronbach alpha with bootstrap confidence interval.

Can also do that within a `group_by()` grouping.

Show code
``````tmpDat %>%
select(-baseVar) %>%
group_by(session) %>%
unnest_wider(alpha)
``````
``````# A tibble: 10 x 4
session alpha   LCL   UCL
<int> <dbl> <dbl> <dbl>
1       1 0.921 0.910 0.931
2       2 0.912 0.900 0.923
3       3 0.919 0.908 0.929
4       4 0.903 0.889 0.916
5       5 0.908 0.894 0.919
6       6 0.901 0.885 0.915
7       7 0.911 0.895 0.923
8       8 0.912 0.899 0.923
9       9 0.910 0.895 0.921
10      10 0.915 0.903 0.926``````

And that was nice and easy to feed into a forest style plot, as follows.

Show code
``````tmpDat %>%
select(-baseVar) %>%
group_by(session) %>%
unnest_wider(alpha) -> tmpTib

psychometric::alpha(tmpDat[, 3:7]) -> tmpAlphaAll

ggplot(data = tmpTib,
aes(x = session, y = alpha)) +
geom_point() + # get the observed alphas in as points
geom_linerange(aes(ymin = LCL, ymax = UCL)) + # add the CIs as lines
geom_hline(yintercept = tmpAlphaAll) + # not really very meaningful to have an overall alpha but
# perhaps better than not having a reference line
xlab("Session") +
ylab("Cronbach alpha") +
ggtitle("Forest plot of observed Cronbach alpha per session",
subtitle = paste0("Vertical lines are 95% CIs, ",
bootReps,
" bootstrap replications, percentile method.")) +
theme_bw() + # nice clean theme
theme(plot.title = element_text(hjust = .5), # centre the title
plot.subtitle = element_text(hjust = .5)) # and subtitle
``````

Well, as you’d expect from the simulation method, no evidence of heterogeneity of Cronbach’s alpha across sessions!

I hope this is a useful further introduction to piping, dplyr and some of the tidyverse approach. I guess it introduced the corrr package, `cur_data()` and `rowwise()` … and it finished with a, for me, typical use of `ggplot()` (from the ggplot2 package.)

Do contact me if you have any comments, suggestions, corrections, improvements … anything!

### Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

### Citation

`Evans (2021, Feb. 10). Chris (Evans) R SAFAQ: More piping, and rowwise(). Retrieved from https://www.psyctc.org/R_blog/posts/2021-02-10-more-piping-introducing-rowwise/`
```@misc{evans2021more,