# Correlation

**Correlation**

A correlation between two variables x and y is a **standardized** measure of how much two random variables X and Y change together in a **linear** way. A correlation is usually denoted as 'r'. It's values can go from -1 to +1. A strong positive correlation indicates that greater values in one variable correspond to greater values in the other variable. A strong negative correlation indicates that greater values in one variable correspond to smaller values in the other variable. A correlation of 0 indicates that there's no linear relationship between the two variables x and y.

**Pearson's product moment correlation**

**Requirements:**

Both random variables must be at least interval scaled and bivariate normal distribution is required.

Illustration of a bivariate normal distribution

**Calculation:**

Suppose we have two normally distributed random variables x and y.

x_{i} and y_{i} denote the values of x and y for case i.

then the correlation is defined as:

If the standard deviations of x and y as well as the covariance between x and y are known the correlation can be defined as:

And a correlation can always be written as the cross-product of the standardized values of x and y

**Spearman's rank correlation**

Spearman's rank correlation is applied if the random variables X and Y are ordinal scaled.

Spearman's rank correlation is identical to the Pearson's product moment correlation if the values of both X and Y variables are transformed into ranks (values range from 1 to N)

It can be rewritten as:

and standard error:

whereas d_{i} denotes the rank difference of observation i

**Interpretation**

Correlation does not imply causation.

In principle there are four different ways to interprete a correlation between two variables X and Y supposed the correlation is not a coincidence:

X causes Y

Y causes X

X causes Y and Y causes X (bidirectional causation)

There is a third variable Z that causes both X and Y

There can be no conclusion made regarding the existence or the direction of a cause-and-effect relationship only from the fact that X and Y are correlated.

**Fisher's Z-transformation**

The Fisher's Z-transformation is approximate variance-stabilizing transformation of r when the two random variables X and Y are bivariate normal distributed. The Fisher's Z-transformation is used for example when correlations coefficients are averaged and when testing certain hypotheses about correlations.

**Calculation:**

wheras 'ln' is the natural logarithm function and 'arctanh' is the inverse hyperbolic function.

the standard error of the Z-transformed correlation is

So, the Fisher's Z transformation and it's inverse

can be used to calculate confidence intervals for correlation coefficients.

**Averaging correlations**

If correlations originate from equal-sized samples you can simply take the inverse of the averaged Z-transformed correlation coefficients.

If sample sizes are not equal the following formula applies:

whereas Z_{j} are the Z-transformed correlation coefficients and n_{j} are the corresponding sample sizes

**Testing correlation hypotheses**

#### Case A) testing H_{0}: ρ=0

This is by far the most common case. Normally one is interested if a given correlation (ρ) differs significantly from a hypothesized zero-correlation in the population.

In such a case the following t-test applies:

The t-value has (n-2) degrees of freedom.

#### Case B) testing H_{0}: ρ=ρ_{0}<>0

Sometimes you want to test if a given correlation (ρ) is different from a well known correlation in the population (ρ_{0}) that is different from zero

In that case you can calculate the following z-value of the standard normal distribution (CAUTION: Do not confound the z-value from the standard normal distribution and the Fisher's Z-values):

Z = Fisher's Z-transformation of the given correlation

Z_{0} = Fisher's Z-transformation of the well known population correlation

#### Case C) testing H_{0}: ρ_{1}=ρ_{2}

If you want to test if two correlation coefficients from two independent samples differ significantly, the following z-value is applicable:

#### Case D) testing H_{0}: ρ_{1}=ρ_{2}=...=ρ_{k}

If you want to test if k correlation coefficients from k independent samples differ significantly, the following Χ^{2}-distributed value applicable as test value:

The χ^{2}-value has k-1 degrees of freedom.

**Example of a Correlation**

x | y | x^{2} |
y^{2} |
x*y | |

2 | 1 | 4 | 1 | 2 | |

1 | 2 | 1 | 4 | 2 | |

9 | 6 | 81 | 36 | 54 | |

5 | 4 | 25 | 16 | 20 | |

3 | 2 | 9 | 4 | 6 | |

Σ | 20 | 15 | 120 | 61 | 84 |

then the correlation is:

We get a positive correlation, so greater values in x correspond to greater values in y. The following figure illustrates this:

The t-value for testing if the correlation is significantly different from a zero-correlation is:

and because this t-value is greater than the critical t-value for a non directed test (t_{(df=3, alpha=0.05)}=3.182) we can say that the obtained correlation coefficient differs significantly from zero.

BrightStat output of the correlation example

**Wiki link correlation**

**Wiki link correlation and causation**

**Wiki link Fisher's Z-transformation**

**References**

Galton, F. (1886). Regression towards mediocrity in hereditary stature. *Journal of the Anthropological Institute of Great Britain and Ireland, 15,* 246–263.

Pearson, K. (1895). Notes on regression and inheritance in the case of two parents. *Proceedings of the Royal Society of London, 58*, 240–242.

Fisher, R.A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. *Biometrika, 10*(4), 507–521.

Bortz, J. (2005). *Statistik für Human- und Sozialwissenschaftler (6 ^{th} Edition).* Heidelberg: Springer Medizin Verlag.