Correlation

Correlation

A correlation between two variables x and y is a standardized measure of how much two random variables X and Y change together in a linear way. A correlation is usually denoted as 'r'. It's values can go from -1 to +1. A strong positive correlation indicates that greater values in one variable correspond to greater values in the other variable. A strong negative correlation indicates that greater values in one variable correspond to smaller values in the other variable. A correlation of 0 indicates that there's no linear relationship between the two variables x and y.

Pearson's product moment correlation

Requirements:

Both random variables must be at least interval scaled and bivariate normal distribution is required.

Illustration of a bivariate normal distribution

 

bivariate normal distribution

Calculation:

Suppose we have two normally distributed random variables x and y.

xi and yi denote the values of x and y for case i.

then the correlation is defined as:

 

correlation

 

If the standard deviations of x and y as well as the covariance between x and y are known the correlation can be defined as:

 

correlation 2

 

And a correlation can always be written as the cross-product of the standardized values of x and y

 

correlation

 


Spearman's rank correlation

Spearman's rank correlation is applied if the random variables X and Y are ordinal scaled.

Spearman's rank correlation is identical to the Pearson's product moment correlation if the values of both X and Y variables are transformed into ranks (values range from 1 to N)

It can be rewritten as:

 

Pearson correlation coefficient

 

and standard error:

 

Pearson correlation coefficient standard error

 

whereas di denotes the rank difference of observation i

 


Interpretation

Correlation does not imply causation.

In principle there are four different ways to interprete a correlation between two variables X and Y supposed the correlation is not a coincidence:

 

X causes Y

Y causes X

X causes Y and Y causes X (bidirectional causation)

There is a third variable Z that causes both X and Y

 

correlation and causation

 

There can be no conclusion made regarding the existence or the direction of a cause-and-effect relationship only from the fact that X and Y are correlated.

 


Fisher's Z-transformation

The Fisher's Z-transformation is approximate variance-stabilizing transformation of r when the two random variables X and Y are bivariate normal distributed. The Fisher's Z-transformation is used for example when correlations coefficients are averaged and when testing certain hypotheses about correlations.

Calculation:

 

Fisher's Z transformation

 

wheras 'ln' is the natural logarithm function and 'arctanh' is the inverse hyperbolic function.

the standard error of the Z-transformed correlation is

 

standard error of Fisher's Z value

 

So, the Fisher's Z transformation and it's inverse

 

inverse of Fisher's Z transformation

 

can be used to calculate confidence intervals for correlation coefficients.

 


Averaging correlations

If correlations originate from equal-sized samples you can simply take the inverse of the averaged Z-transformed correlation coefficients.

If sample sizes are not equal the following formula applies:

 

averaging correlations from unequally sized samples

 

whereas Zj are the Z-transformed correlation coefficients and nj are the corresponding sample sizes

 


Testing correlation hypotheses

 

Case A) testing H0: ρ=0

This is by far the most common case. Normally one is interested if a given correlation (ρ) differs significantly from a hypothesized zero-correlation in the population.

In such a case the following t-test applies:

 

Test correlation against zero

 

The t-value has (n-2) degrees of freedom.

 

Case B) testing H0: ρ=ρ0<>0

Sometimes you want to test if a given correlation (ρ) is different from a well known correlation in the population (ρ0) that is different from zero

In that case you can calculate the following z-value of the standard normal distribution (CAUTION: Do not confound the z-value from the standard normal distribution and the Fisher's Z-values):

 

Test correlation agains non zero

 

Z = Fisher's Z-transformation of the given correlation

Z0 = Fisher's Z-transformation of the well known population correlation

 

Case C) testing H0: ρ12

If you want to test if two correlation coefficients from two independent samples differ significantly, the following z-value is applicable:

 

test if two independent correlations are different

 

Case D) testing H0: ρ12=...=ρk

If you want to test if k correlation coefficients from k independent samples differ significantly, the following Χ2-distributed value applicable as test value:

 

test if k independent correlations are different

 

The χ2-value has k-1 degrees of freedom.


Example of a Correlation

 

  x y x2 y2 x*y
  2 1 4 1 2
  1 2 1 4 2
  9 6 81 36 54
  5 4 25 16 20
  3 2 9 4 6
Σ 20 15 120 61 84

 

then the correlation is:

 

covariance

 

covariance

 

We get a positive correlation, so greater values in x correspond to greater values in y. The following figure illustrates this:

 

scatterplot correlation

 

The t-value for testing if the correlation is significantly different from a zero-correlation is:

 

scatterplot correlation

 

and because this t-value is greater than the critical t-value for a non directed test (t(df=3, alpha=0.05)=3.182) we can say that the obtained correlation coefficient differs significantly from zero.

 

BrightStat output of the correlation example

 


Wiki link correlation

Wiki link correlation and causation

Wiki link Fisher's Z-transformation


References

Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263.

Pearson, K. (1895). Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58, 240–242.

Fisher, R.A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10(4), 507–521.

Bortz, J. (2005). Statistik für Human- und Sozialwissenschaftler (6th Edition). Heidelberg: Springer Medizin Verlag.




 

Gallery

 
 
map kinase