sampling distribution of correlation r

two independent variables

If r is the correlation coefficient of two independent variables x1 and x2, then r should be 0. However the sample estimation of r is not exactly 0. There is a sampling distribution around 0.

rep = 1000; n = 25
set.seed(1)
cors = c()
for(i in 1:rep){
  x1 = rnorm(n)
  x2 = rnorm(n)
  r = cor(x1,x2)
  cors = c(cors, r)
}
layout(matrix(1:2,1,2))
par(cex.main=0.75)
plot(x1, x2, main = paste("Scatterplot of last simulation\nn=",n,", estimated r = ", round(r,4), sep=""))
hist(cors, main = paste("Distribution of estimated r\nmean=",round(mean(cors),4)," SD=",round(sd(cors),4),"\n(",rep," simulations of sample size = ",n,")", sep=""), xlab=expression(hat(r)), freq=F)

two correlated variables

However, if the population correlation is not equal to 0, the distribution of estimated r is skewed.

rep = 1000; n = 25
set.seed(1)
cors = c()
for(i in 1:rep){
  x1 = rnorm(n)
  x2 = rnorm(n)
  r = cor(x1,x1+2*x2)
  cors = c(cors, r)
}
layout(matrix(1:2,1,2))
par(cex.main=0.75)
plot(x1, x2, main = paste("Scatterplot of last simulation\nn=",n,", estimated r = ", round(r,4), sep=""))
hist(cors, main = paste("Distribution of estimated r\nmean=",round(mean(cors),4)," SD=",round(sd(cors),4),"\n(",rep," simulations of sample size = ",n,")", sep=""), xlab=expression(hat(r)), freq=F)

Note that if x1 and x2 are two independent standard normal variables, the correlation between x1 and x1+2*x2 should be:

\[cor(x_1,x_1+2x_2) = \frac{cov(x_1, x_1+2x_2)}{\sqrt{var(x_1) \times var(x_1+2x_2)}}\] \[= \frac{cov(x_1,x_1)+2cov(x_1,x_2)}{\sqrt{var(x1)\times (var(x_1)+4var(x_2)+2\times 2 \times cov(x_1,x_2))}}\] \[= \frac{1+0}{\sqrt{1 \times 5}} = 0.4472136\]

Fisher’s z transformtion

Fisher developed a way to r to z’:

\[z' = \frac{1}{2} \times ln\frac{1+r}{1-r} \sim N( \mu, \frac{1}{\sqrt{N-3}} )\]

rep = 1000; n = 25
set.seed(1)
zs = c()
for(i in 1:rep){
  x1 = rnorm(n)
  x2 = rnorm(n)
  r = cor(x1,x1+2*x2)
  z = log((1+r)/(1-r))/2
  zs = c(zs, z)
}
layout(matrix(1:2,1,2))
par(cex.main=0.75)
plot(x1, x2, main = paste("Scatterplot of last simulation\nn=",n,", estimated r = ", round(r,4),", z' = ", round(z,4), sep=""))
hist(zs, main = paste("Distribution of z'\nmean=",round(mean(zs),4)," SD=",round(sd(zs),4),"\n(",rep," simulations of sample size = ",n,")", sep=""), xlab="z'", freq=F)

### The mean of z'
0.5 * log((1+0.4472136)/(1-0.4472136))

## [1] 0.4812118

### The standard error of z'
1/sqrt(n-3)

## [1] 0.2132007

Reference

Sampling Distribution of Pearson’s r