3 Inferences in Regeression and Correlation Analysis
3.1 Inferences Concerning \(\beta_{1}\)
- Sampling Distribution of b1
The sampling distribution of b1 refers to the different values of b1 that would be obtained with repeated sampling when the levels of the predictor variable X are held constant from sample to sample.
For normal error regression model, the sample distributon of b1 is normal, with mean and variance:
\[E(b_1) = \beta_{1}\]
\[\sigma^{2}(b_1) = \frac{\sigma^{2}}{\sum(X_{i} - \bar{X})^{2}}\]
- Proof
b1 as linear combination of the Yi
\[b1 = \sum k_{i}Y_{i}\text{ where }k_{i} = \frac{X_{i} - \bar{X}}{\sum(X_{i} - \bar{X})^{2}}\] - Nomaily
The \(Y_{i}\) are independently, normally distributed, so b1 are normally distributed.
- Mean
\[E(b_{1}) = E(\sum k_{i}Y_{i}) = \sum k_{i}E(Y_{i}) = \sum k_{i}(\beta_{0} + \beta_{1}X_{i}) = \beta_{1}\]
hint:
\[\sum k_{i} = 0\]
\[\sum k_{i}X_{i} = 1\]
- Variance
\[\sigma^{2}(b_{1}) = \sigma^{2}(\sum k_{i}Y_{i}) = \sum k_{i}^{2}\sigma^{2}(Y_{i}) = \sum k_{i}^{2}\sigma^{2} = \sigma^{2}\frac{1}{\sum (X_{i} - \bar{X})^{2}}\]
- Estimated Variance
Replace the paramter \(\sigma^{2}\) with MSE:
\[s^{2}(b_{1}) = \frac{MSE}{\sum(X_{i} - \bar{X})^{2}}\]
- Sampling Distribution of \((b_{1} - \beta_{1})/s(b_{1})\)
\[(b_{1} - \beta_{1})/\sigma(b_{1}) \sim N(0,1)\]
\[(b_{1} - \beta_{1})/s(b_{1}) \sim t(n-2)\]
When a statistic is standardized but the denominator is an estimated standard deviation rather than the true standard deviation, it is called a studentized statistic.
- Comment
\[SSE/\sigma^{2} \sim \chi^{2}(n - 2)\]
\[(b_{1} - \beta_{1})/s(b_{1}) \sim \frac{z}{\sqrt{\frac{\chi^2(n-2)}{n-2}}} = t(n-2)\]
- Confidence Interval for \(\beta_{1}\)
\[b_{1} \pm t(1-\alpha/2; n-2)s(b_{1})\text{ where }\alpha\text{ is significance level}\]
- Tests concerning \(\beta_{1}\)
Since \((b_{1} - \beta_{1})/s(b_{1})\) is ditributed as t with n - 2degrees of freedom, tests concerning \(\beta_{1}\) can be set up in ordinary fashion using the t distribution.
3.2 Inferences Concerning \(\beta_{0}\)
The sampling distribution of \(\beta_{0}\) is normal, with mean and variance:
\[E(b_{0}) = \beta_{0}\]
\[\sigma^{2}(b_{0}) = \sigma^{2}[\frac{1}{n} + \frac{\bar{X}^{2}}{\sum (X_{i} - \bar{X})^{2}}]\]
\[s^{2}(b_{0}) = MSE[\frac{1}{n} + \frac{\bar{X}^{2}}{\sum (X_{i} - \bar{X})^{2}}]\]
\[\frac{b_{0} - \beta_{0}}{s(b_{0})} \sim t(n-2)\]
3.3 Some Considerations on Making Inferences Concerning \(\beta_{0}\) and \(\beta_{1}\)
- Effects of Departures from Normality
- Interpretation of Confidence Coefficient and Risks of Errors
- Spacing of the X levels
- Power of Tests
The power of this test is the probability that the decision rule will lead to conclusion \(H_{a}\) when \(H_{a}\) in fact holds. Specifically, the power is given by
\[Power = P(|t^{*}| > t(1-\alpha/2;n-2)|\delta)\]
where,
- \(H_{0}: \beta_{1} = \beta_{10}\); \(H_{a}: \beta_{1} \neq \beta_{10}\)
- \(t^{*} = \frac{b_{1} - \beta_{10}}{s(b_{1})}\)
- \(\delta\) is the noncentrality measure, a measure of how far the true value of \(\beta_{1}\) is from \(\beta_{10}\). \(\delta = \frac{\mid\beta_{1} - \beta_{10}\mid}{\sigma(b_{1})}\)
3.4 Interval Estimation of \(E(Y_{h})\)
The mean response when \(X = X_{h}\) is denoted by \(E(Y_{h})\). The \(E(Y_{h})\) point estimator \(\hat{Y}_{h}\) :
\[\hat{Y}_{h} = b_{0} + b_{1}X_{h}\]
- Sampling Distribution of \(\hat{Y}_{h}\)
For normal error regression model, the sampling distribution of \(\hat{Y}_{h}\) is normal, with mean and variance:
\[E(\hat{Y}_{h}) = E(Y_{h})\]
\[\sigma^{2}(\hat{Y}_{h}) = \sigma^{2}[\frac{1}{n} + \frac{(X_{h} - \bar{X})^2}{\sum(X_{i} - \bar{X})^{2}}]\]
\[s^{2}(\hat{Y}_{h}) = MSE[\frac{1}{n} + \frac{(X_{h} - \bar{X})^{2}}{\sum (X_{i} - \bar{X})^{2}}]\]
\[\frac{\hat{Y}_{h} - E(Y_{h})}{s(\hat{Y}_{h})} \sim t(n-2)\]
3.5 Prediction of New Observation
We denote the level of X for the new trial as \(X_{h}\) and the new observation on Y as \(Y_{h(new)}\).
In the former case, the estimation of \(E(Y_{h})\) is the mean of the distribution of Y; in the present case, we predict an individual outcome draw from the distribution of Y.
Hence, two components of \(\sigma(pred)\):
- The variance of the distribution of Y at \(X = X_{h}\), namely \(\sigma^{2}\)
- The variance of the sampling distribution of \(\hat{Y}_h\), namely \(\sigma^{2}(\hat{Y}_h)\)
\[\sigma^{2}(pred) = \sigma^{2}(Y_{h(new)} - \hat{Y}_{h}) = \sigma^{2} + \sigma^{2}(\hat{Y}_{h})\]
\[s^{2}(pred) = MSE[1 + \frac{1}{n} + \frac{(X_{h} - \bar{X})^{2}}{\sum (X_{i} - \bar{X})^{2}}]\]
3.6 Confidence Band for Regression Line
To obtain a confidence band for the entire for the entire regression line \(E(Y) = \beta_{0} + \beta_{1}X\).
The Working-Hotellling 1 - \(\alpha\) confidence band:
\[\hat{Y}_{h} \pm Ws(\hat{Y}_{h})\]
where,
\[W^{2} = 2F(1-\alpha; 2, n-2)\]
Since, we are doing all values of \(X_{h}\) at once, it will be wider at each \(X_{h}\) than CIs for individual \(X_{h}\).
3.7 Analysis of Variance Approach
- Partitioning of Total Sum of Squares
\[Y_{i} - \bar{Y} = \hat{Y}_{i} - \bar{Y} + Y_{i} - \hat{Y}_{i}\]
\[\sum (Y_{i} - \bar{Y})^{2} = \sum (\hat{Y}_{i} - \bar{Y})^{2} + \sum (Y_{i} - \hat{Y}_{i})^{2}\]
\[SSTO = SSR + SSE\]
SSTO stands for total sum of squares, SSE stands for error sum of squares and SSR stands for regression sum of squares.
- Breakdown of Degrees of Freedom
\[n - 1 = 1 + (n - 2)\]
We have n-1 degrees of freedom associated with SSTO. SSE has n-2 degrees of freedom and SSR has 1 degree of freedom.
- Mean Squares
A sum of squares divided by its associated degrees of freedom is called a mean square (MS)
The mean squares are not additive:
\(\frac{SSTO}{n-1} \neq \frac{SSR}{1} + \frac{SSE}{n-2} = MSR + MSE\)
- ANalysis Of VAriance Table (ANOVA table)
ANOVA table: The breakdowns of the total sum of squares and associated degrees of freedom are displayed in the form of ANVOA.
SSTOU: the total uncorrected sum of squares, \(\sum Y_i^2\)
SS: correction for the mean sum of squares, \(n\bar{Y}^2\)
SSTO = \(\sum (Y_i - \bar{Y})^2 = \sum Y_i^2 - n\bar{Y}^2 = SSTOU + SS\)
Source of Variation | SS | df | MS |
---|---|---|---|
Regression | \(SSR = \sum(\hat{Y}_i - \bar{Y})^2\) | 1 | \(MSR = \frac{SSR}{1}\) |
Error | \(SSE = \sum(Y_i - \hat{Y}_i)^2\) | n-2 | \(MSE = \frac{SSE}{n-2}\) |
Total | \(SSTO = \sum(Y_i - \bar{Y})^2\) | n-1 | |
Correction for mean | \(SS \text{(correction for mean)} = n\bar{Y}^2\) | 1 | |
Total, uncorrected | \(SSTOU = \sum Y_i^2\) | n |
- Expected Mean Squares
\[E(MSE) = \sigma^2\]
\[E(MSR) = \sigma^2 + \beta_1^2 \sum (X_i - \bar{X})^2\]
- F test for \(\beta_1 = 0\) versus \(\beta_1 \neq 0\)
Test Statistic: \(F^* = \frac{MSR}{MSE} \sim F(1,n-2)\)
3.8 General Linear Test Approach
- Two models:
- \(Y_i = \beta_0 + \beta_2X_i + \varepsilon_i\) (full model)
- \(Y_i = \beta_0 + \varepsilon_i\) (reduced model under H0)
- F-statistic:
\[F = \frac{(SSE(R) - SSE(F))/(df_R - df_F)}{SSE(F)/df_F}\]
The general linear teest approach can be used for highly complex tests of linear statistical models, as well as for simple tests. The basic steps in summary form are:
- Fit the full model and obtain the error sum of squares SSE(F)
- Fit the reduced model under H0 and obtain the error sum of squares SSE(R)
- Use the test statistic and desicison rule
3.9 Descriptive Measures of Linear Association between X and Y
- Coefficient of Determination
\[R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO}, 0 \leq R^2 \leq 1\]
- Limitations of \(R^2\)
Tree common misunderstandings about \(R^2\)
- A high coefficient of determination indicates that useful predictions can be made.
- A high coefficient of determination indicates that the estimated regression line is a good fit.
- A coefficient of determination near zero indicates that X and Y are not related.
- Coefficient of Correlation
\[r = \pm \sqrt{R^2}, -1 \leq r \leq 1\]
3.10 Considerations in Applying Regression Analysis
- make inferences
- the predictor variable itsef often has to be predicted
- the levels of the predictor variable that fall outside the range of observations
- \(\beta_1 \neq 0\) doesnot establish a cause-and-effect relation
- multiple testing
- observations on the predictor variable X are subject to measurement erros
3.11 Normal Correlation Models
- Distinction between Regression and Correlation Model
- Bivariate Normal Distribution
- Conditional Inferences
- Inferences on Correlation Coefficients
- Spearman Rank Correlation Coefficient
3.12 R code
3.12.1 Example data
head(trees)
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
X = trees$Volume ## 体积
Y = trees$Girth ## 直径
3.12.2 built-in function
fit <- lm(Y~X)
summary(fit)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2945 -0.5742 -0.1520 0.7131 1.5248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.677857 0.308628 24.88 <2e-16 ***
## X 0.184632 0.009016 20.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8117 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
coefficients(fit) # model coefficients
## (Intercept) X
## 7.6778570 0.1846321
confint(fit, level=0.95) # CIs for model parameters
## 2.5 % 97.5 %
## (Intercept) 7.0466415 8.3090724
## X 0.1661924 0.2030719
fitted(fit) # predicted values
## 1 2 3 4 5 6 7
## 9.579568 9.579568 9.561105 10.705824 11.148941 11.315110 10.558118
## 8 9 10 11 12 13 14
## 11.038162 11.850543 11.352036 12.145955 11.555132 11.628985 11.610521
## 15 16 17 18 19 20 21
## 11.204331 11.776690 13.918423 12.736777 12.422903 12.275197 14.047666
## 22 23 24 25 26 27 28
## 13.530696 14.380003 14.749268 15.543186 17.906477 17.961867 18.441910
## 29 30 31
## 17.186412 17.094096 21.894531
residuals(fit) # residuals
## 1 2 3 4 5 6
## -1.27956795 -0.97956795 -0.76110474 -0.20582396 -0.44894108 -0.51511000
## 7 8 9 10 11 12
## 0.44188174 -0.03816180 -0.75054318 -0.15203642 -0.84595459 -0.15513177
## 13 14 15 16 17 18
## -0.22898462 0.08947859 0.79566928 1.12330967 -1.01842306 0.56322259
## 19 20 21 22 23 24
## 1.27709721 1.52480292 -0.04766555 0.66930442 0.11999661 1.25073235
## 25 26 27 28 29 30
## 0.75681418 -0.60647711 -0.46186675 -0.54191030 0.81358820 0.90590427
## 31
## -1.29453117
anova(fit) # anova table
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X 1 276.328 276.328 419.36 < 2.2e-16 ***
## Residuals 29 19.109 0.659
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3.12.3 point estimator \(b_0\) and \(b_1\)
n = nrow(trees)
Xbar = mean(X)
Ybar = mean(Y)
b1 = sum((X - Xbar)*(Y - Ybar))/sum((X-Xbar)^2)
b0 = Ybar - b1*Xbar
b1;b0
## [1] 0.1846321
## [1] 7.677857
3.12.4 Residuals, SSE and MSE
residual = Y - b1*X - b0
SSE = sum(residual^2)
MSE = SSE/(n-2)
SSE; MSE; sqrt(MSE)
## [1] 19.10893
## [1] 0.6589286
## [1] 0.8117442
3.12.5 sampling distribution of \(b_1\) and \((b_1−\beta_1)/s(b_1)\)
s = sqrt( MSE/sum((X - Xbar)^2))
t = b1 / s
p = (1 - pt(t, n -2))*2
s; t; p
## [1] 0.009015995
## [1] 20.47829
## [1] 0
3.12.6 F test
SSTO = var(Y) * (n-1)
F = (SSTO - SSE)/((n-1) - (n-2)) / (SSE/(n-2))
F
## [1] 419.3603
3.12.7 \(R^2\) and r
Rsq = 1 - SSE/SSTO
r = b1/abs(b1) * sqrt(Rsq)
Rsq; r; cor(X,Y)
## [1] 0.9353199
## [1] 0.9671194
## [1] 0.9671194
3.12.8 plot
plot(X,Y)
abline(b0,b1, col="red")