Correlation
Correlation
A correlation between two variables x and y is a standardized measure of how much two random variables X and Y change together in a linear way. A correlation is usually denoted as 'r'. It's values can go from -1 to +1. A strong positive correlation indicates that greater values in one variable correspond to greater values in the other variable. A strong negative correlation indicates that greater values in one variable correspond to smaller values in the other variable. A correlation of 0 indicates that there's no linear relationship between the two variables x and y.
Pearson's product moment correlation
Requirements:
Both random variables must be at least interval scaled and bivariate normal distribution is required.
Illustration of a bivariate normal distribution
Calculation:
Suppose we have two normally distributed random variables x and y.
xi and yi denote the values of x and y for case i.
then the correlation is defined as:
If the standard deviations of x and y as well as the covariance between x and y are known the correlation can be defined as:
And a correlation can always be written as the cross-product of the standardized values of x and y
Spearman's rank correlation
Spearman's rank correlation is applied if the random variables X and Y are ordinal scaled.
Spearman's rank correlation is identical to the Pearson's product moment correlation if the values of both X and Y variables are transformed into ranks (values range from 1 to N)
It can be rewritten as:
and standard error:
whereas di denotes the rank difference of observation i
Interpretation
Correlation does not imply causation.
In principle there are four different ways to interprete a correlation between two variables X and Y supposed the correlation is not a coincidence:
X causes Y
Y causes X
X causes Y and Y causes X (bidirectional causation)
There is a third variable Z that causes both X and Y
There can be no conclusion made regarding the existence or the direction of a cause-and-effect relationship only from the fact that X and Y are correlated.
Fisher's Z-transformation
The Fisher's Z-transformation is approximate variance-stabilizing transformation of r when the two random variables X and Y are bivariate normal distributed. The Fisher's Z-transformation is used for example when correlations coefficients are averaged and when testing certain hypotheses about correlations.
Calculation:
wheras 'ln' is the natural logarithm function and 'arctanh' is the inverse hyperbolic function.
the standard error of the Z-transformed correlation is
So, the Fisher's Z transformation and it's inverse
can be used to calculate confidence intervals for correlation coefficients.
Averaging correlations
If correlations originate from equal-sized samples you can simply take the inverse of the averaged Z-transformed correlation coefficients.
If sample sizes are not equal the following formula applies:
whereas Zj are the Z-transformed correlation coefficients and nj are the corresponding sample sizes
Testing correlation hypotheses
Case A) testing H0: ρ=0
This is by far the most common case. Normally one is interested if a given correlation (ρ) differs significantly from a hypothesized zero-correlation in the population.
In such a case the following t-test applies:
The t-value has (n-2) degrees of freedom.
Case B) testing H0: ρ=ρ0<>0
Sometimes you want to test if a given correlation (ρ) is different from a well known correlation in the population (ρ0) that is different from zero
In that case you can calculate the following z-value of the standard normal distribution (CAUTION: Do not confound the z-value from the standard normal distribution and the Fisher's Z-values):
Z = Fisher's Z-transformation of the given correlation
Z0 = Fisher's Z-transformation of the well known population correlation
Case C) testing H0: ρ1=ρ2
If you want to test if two correlation coefficients from two independent samples differ significantly, the following z-value is applicable:
Case D) testing H0: ρ1=ρ2=...=ρk
If you want to test if k correlation coefficients from k independent samples differ significantly, the following Χ2-distributed value applicable as test value:
The χ2-value has k-1 degrees of freedom.
Example of a Correlation
x | y | x2 | y2 | x*y | |
2 | 1 | 4 | 1 | 2 | |
1 | 2 | 1 | 4 | 2 | |
9 | 6 | 81 | 36 | 54 | |
5 | 4 | 25 | 16 | 20 | |
3 | 2 | 9 | 4 | 6 | |
Σ | 20 | 15 | 120 | 61 | 84 |
then the correlation is:
We get a positive correlation, so greater values in x correspond to greater values in y. The following figure illustrates this:
The t-value for testing if the correlation is significantly different from a zero-correlation is:
and because this t-value is greater than the critical t-value for a non directed test (t(df=3, alpha=0.05)=3.182) we can say that the obtained correlation coefficient differs significantly from zero.
BrightStat output of the correlation example
Wiki link correlation
Wiki link correlation and causation
Wiki link Fisher's Z-transformation
References
Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263.
Pearson, K. (1895). Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58, 240–242.
Fisher, R.A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10(4), 507–521.
Bortz, J. (2005). Statistik für Human- und Sozialwissenschaftler (6th Edition). Heidelberg: Springer Medizin Verlag.