Pearson correlation coefficient

A correlation coefficient indexes whether increases and decreases in values for one variable are associated with increases and decreases in values on another variable. So this is a bivariate measure, i.e. about relationships between two variables and observations must be paired. Examples might be whether clients’ ages are related to how many sessions of therapy they attend for in a service offering fairly open lengths of therapy, another might be whether assessment scores on a measure are systematically related to scores on a different measure (this is an index of “convergent validity” if the two measures are expected to show a positive relationship. The Pearson coefficient is one of a number of correlation coefficients, see details.

Details #

The Pearson is pretty much the paradigmatic correlation coefficient. It is an index of the arithmetic, i.e. linear, relationship between the pairs of scores. The first step to compute is to replace the scores on the first variable by how much each value differs from the mean for that variable, then to do the same for the other variable. Here’s an illustration for this very small dataset.

Raw data for Pearson correlation #

wdt_ID	wdt_created_by	wdt_created_at	wdt_last_edited_by	wdt_last_edited_at	ID	x	y
1	chris	05/08/2024 06:15 PM	chris	05/08/2024 06:15 PM	1	-0.45	-0.15
2	chris	05/08/2024 06:15 PM	chris	05/08/2024 06:15 PM	2	-0.11	-1.02
3	chris	05/08/2024 06:15 PM	chris	05/08/2024 06:15 PM	3	0.59	0.90
4	chris	05/08/2024 06:15 PM	chris	05/08/2024 06:15 PM	4	0.71	0.57

To see how to get from their to the Pearson correlation between those two, which is .74, see here in my Rblog post about correlation coefficients.

That blog post shows how the calculation works: because the product of two negative values is positive as, obviously, is the product of two positives and as the product of a negative and a positive is negative this means that to the extent that the values for the one variable above the mean for that variable are matched with values above the mean value for the other variable those associations push the coefficient up and as values below the mean for that variable are associated with those above the mean of the other variable will push it down we end up with a nice index of association. The standardisation, and division by the product of the SDs mean that the scale of the variables don’t affect the value of the coefficient and ensure that the coefficients must fall between -1 and +1 with +1 indication perfect positive correlation, i.e. all values of the one variable are above the mean for that variable when the values for the other variable are above its mean and that the deviations from the means are strictly proportional to each other. A coefficient of -1 is perfect negative correlation i.e. all values of the one variable are above the mean for that variable when the values for the other variable are below its mean and that the deviations from the means are strictly proportional to each other.

To the extent that this interests us, that’s all there is to it and it’s good, however, it makes most sense if you can argue that the values of both variables have “interval scaling”: that equal differences in scores, regardless of the actual scores, have the same meaning. It doesn’t have to be the same scaling on each variable, so the correlation beween Centigrade/Celsius temperatures and Fahrenheit temperatures will always be perfect, Pearson correlation +1.0 as the difference between 20 degrees Celsius and 30 degrees Celsius is the same as between 50 degrees and 60 on the Celsius scale just as the difference between 20and 30 degrees Fahrenheit is the same difference as between 40 and 50 degrees Fahrenheit (even though 20 degrees Celsius is rather different from 20 degrees Fahrenheit, the one above the melting point of water at atmospheric pressure, the other well below!) Though formally that’s a requirement if one is to two Pearson coefficients with the same value as indicating exactly the same relations (whatever the data that created the two correlations) but it’s rarely satisfied for psychological variables.

This issue of scaling and the meaning of a Pearson correlation often gets muddled up with issues about “significance testing” an observed correlation (and, just as much though less often mentioned) with issues about putting a “confidence interval” around an observed value. This is all unpacked more in my Rblog posts:

Correlation coefficients (1) which goes into some of the alternative coefficients to the Pearson, and
Correlation coefficients (2) which goes into significance testing of correlation coefficients but, vitally loops us back to what this is all about: what are we doing when we reach for a correlation coefficient, what is someone trying to do when they report one. These issues are far too often overlooked and can make use of a correlation coefficient very misleading.