6 Model Selection and Interpretation
6.1 Covariate Adjustment
Back to example 4.3 to illustrate the importance of covariate adjustment
### won't run. the data is simulated. see chapter 4
### beta ia positive, increasing the risk
coxph(Surv(ttAll, status)~trt)
### beta changed to negative, decreaing the risk
coxph(Surv(ttAll, status)~trt+strata(genotype))
coxph(Surv(ttAll, status)~trt+genotype)
6.2 Categorical and Continuous Covariates
indicator or dummy variable: values 0 or 1
categorical variables with three or more values, we need multiple dummy variables. one level is selected as a reference level, arbitrarily or driven by the goals of the research project.
Let us say we have a set of k covariates, we can write the model
\[log(\psi_i)=z_{1i}\beta_1+z_{2i}\beta_2+...+z_{ki}\beta_k\]
the parameter \(\beta_j\) is the log hazard ratio
for continuous covariates, it represents the effect of a unit change in the covariate
for dummy variables, it represents the effect of the corresponding level as compared to the reference covariate
non-linear: consider fit a transformation of a covariate, e.g. logarithmic or square root; or “discretize”, such as “Age<50”
interaction
Comparing with linear/logistic regression,
- time-related variables: here we only considering fixed at the begining of the trial (baseline)
- no intercept: “absorbed into the baseline hazard”
Example 6.1
<- factor(c("black", "black", "white", "white", "other",
race "other"))
<- c(48, 52, 87, 82, 67, 53)
age
model.matrix(~ race + age)[,-1]
## raceother racewhite age
## 1 0 0 48
## 2 0 0 52
## 3 0 1 87
## 4 0 1 82
## 5 1 0 67
## 6 1 0 53
### taking white as the reference level
<- relevel(race, ref="white")
race model.matrix(~ race + age)[,-1]
## raceblack raceother age
## 1 1 0 48
## 2 1 0 52
## 3 0 0 87
## 4 0 0 82
## 5 0 1 67
## 6 0 1 53
## include an interaction:
## the product between race (first two columns) and age (third column)
model.matrix(~ race + age + race:age)[,-1]
## raceblack raceother age raceblack:age raceother:age
## 1 1 0 48 48 0
## 2 1 0 52 52 0
## 3 0 0 87 0 0
## 4 0 0 82 0 0
## 5 0 1 67 0 67
## 6 0 1 53 0 53
Example 6.2
### simulate the data
## age from 40 to 80
<- runif(n=60, min=40, max=80)
age ## race with three levels
<- factor(c(rep("white", 20), rep("black", 20), rep("other", 20)))
race <- relevel(race, ref="white")
race
### simulate the survival by a exponential distribution
### with the rate parameter
= -4.5+c(rep(0,20),rep(1,20),rep(2,20))+age*0.05
log.rate.vec ### and no censoring for the survival status
= rexp(n=60, rate=exp(log.rate.vec))
tt = rep(1,60)
status
### now fit a cox proportional hazards model
library(survival)
= coxph(Surv(tt, status)~race+age)
result.cox summary(result.cox)
## Call:
## coxph(formula = Surv(tt, status) ~ race + age)
##
## n= 60, number of events= 60
##
## coef exp(coef) se(coef) z Pr(>|z|)
## raceblack 1.05400 2.86909 0.35139 3.000 0.002704 **
## raceother 2.34552 10.43875 0.43276 5.420 5.96e-08 ***
## age 0.05201 1.05339 0.01359 3.827 0.000129 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## raceblack 2.869 0.3485 1.441 5.713
## raceother 10.439 0.0958 4.470 24.379
## age 1.053 0.9493 1.026 1.082
##
## Concordance= 0.732 (se = 0.035 )
## Likelihood ratio test= 37.89 on 3 df, p=3e-08
## Wald test = 33.76 on 3 df, p=2e-07
## Score (logrank) test = 37.83 on 3 df, p=3e-08
6.3 Hypothesis Testing for Nested Models
Nested models: the covariates of one model is a subset of the covariates in the other
We can compare nested models using a (partial) likelihood ratio test
library(asaur)
attach(pharmacoSmoking)
## The following objects are masked _by_ .GlobalEnv:
##
## age, race
## model A: ageGroup4
= coxph(Surv(ttr, relapse)~ageGroup4)
modelA.coxph modelA.coxph
## Call:
## coxph(formula = Surv(ttr, relapse) ~ ageGroup4)
##
## coef exp(coef) se(coef) z p
## ageGroup435-49 0.0293 1.0297 0.3093 0.095 0.9245
## ageGroup450-64 -0.7914 0.4532 0.3361 -2.355 0.0185
## ageGroup465+ -0.3173 0.7281 0.4435 -0.715 0.4744
##
## Likelihood ratio test=12.22 on 3 df, p=0.006664
## n= 125, number of events= 89
## model B: employment
= coxph(Surv(ttr, relapse)~employment)
modelB.coxph modelB.coxph
## Call:
## coxph(formula = Surv(ttr, relapse) ~ employment)
##
## coef exp(coef) se(coef) z p
## employmentother 0.1982 1.2192 0.2371 0.836 0.403
## employmentpt 0.4500 1.5683 0.3229 1.394 0.163
##
## Likelihood ratio test=2.06 on 2 df, p=0.357
## n= 125, number of events= 89
## model C: ageGroup4+employment
= coxph(Surv(ttr, relapse)~ageGroup4+employment)
modelC.coxph modelC.coxph
## Call:
## coxph(formula = Surv(ttr, relapse) ~ ageGroup4 + employment)
##
## coef exp(coef) se(coef) z p
## ageGroup435-49 -0.1299 0.8782 0.3213 -0.404 0.68594
## ageGroup450-64 -1.0239 0.3592 0.3585 -2.856 0.00429
## ageGroup465+ -0.7825 0.4573 0.5046 -1.551 0.12102
## employmentother 0.5257 1.6917 0.2748 1.913 0.05577
## employmentpt 0.5001 1.6489 0.3315 1.508 0.13143
##
## Likelihood ratio test=16.79 on 5 df, p=0.004922
## n= 125, number of events= 89
### log-likelihood
logLik(modelA.coxph)
## 'log Lik.' -380.043 (df=3)
logLik(modelB.coxph)
## 'log Lik.' -385.1232 (df=2)
logLik(modelC.coxph)
## 'log Lik.' -377.7597 (df=5)
### model A vs model C
pchisq(2*(as.numeric(logLik(modelC.coxph))-as.numeric(logLik(modelA.coxph))), df=5-3, lower.tail=F)
## [1] 0.1019462
### model B vs model C
pchisq(2*(as.numeric(logLik(modelC.coxph))-as.numeric(logLik(modelB.coxph))), df=5-2, lower.tail=F)
## [1] 0.002065407
### model A vs null
= coxph(Surv(ttr, relapse)~1)
model.null.coxph logLik(model.null.coxph)
## 'log Lik.' -386.1533 (df=0)
pchisq(2*(as.numeric(logLik(modelA.coxph))-as.numeric(logLik(model.null.coxph))), df=3-0, lower.tail=F)
## [1] 0.006664492
### the anova() function is a more direct way
anova(modelA.coxph, modelC.coxph)
## Analysis of Deviance Table
## Cox model: response is Surv(ttr, relapse)
## Model 1: ~ ageGroup4
## Model 2: ~ ageGroup4 + employment
## loglik Chisq Df Pr(>|Chi|)
## 1 -380.04
## 2 -377.76 4.5666 2 0.1019