6 Model Selection and Interpretation

6.1 Covariate Adjustment

Back to example 4.3 to illustrate the importance of covariate adjustment

### won't run. the data is simulated. see chapter 4
### beta ia positive, increasing the risk
coxph(Surv(ttAll, status)~trt)

### beta changed to negative, decreaing the risk
coxph(Surv(ttAll, status)~trt+strata(genotype))
coxph(Surv(ttAll, status)~trt+genotype)

6.2 Categorical and Continuous Covariates

indicator or dummy variable: values 0 or 1

categorical variables with three or more values, we need multiple dummy variables. one level is selected as a reference level, arbitrarily or driven by the goals of the research project.

Let us say we have a set of k covariates, we can write the model

\[log(\psi_i)=z_{1i}\beta_1+z_{2i}\beta_2+...+z_{ki}\beta_k\]

the parameter \(\beta_j\) is the log hazard ratio
for continuous covariates, it represents the effect of a unit change in the covariate
for dummy variables, it represents the effect of the corresponding level as compared to the reference covariate
non-linear: consider fit a transformation of a covariate, e.g. logarithmic or square root; or “discretize”, such as “Age<50”
interaction

Comparing with linear/logistic regression,

time-related variables: here we only considering fixed at the begining of the trial (baseline)
no intercept: “absorbed into the baseline hazard”

Example 6.1

race <- factor(c("black", "black", "white", "white", "other",
"other"))
age <- c(48, 52, 87, 82, 67, 53)

model.matrix(~ race + age)[,-1]

##   raceother racewhite age
## 1         0         0  48
## 2         0         0  52
## 3         0         1  87
## 4         0         1  82
## 5         1         0  67
## 6         1         0  53

### taking white as the reference level
race <- relevel(race, ref="white")
model.matrix(~ race + age)[,-1]

##   raceblack raceother age
## 1         1         0  48
## 2         1         0  52
## 3         0         0  87
## 4         0         0  82
## 5         0         1  67
## 6         0         1  53

## include an interaction: 
## the product between race (first two columns) and age (third column)
model.matrix(~ race + age + race:age)[,-1]

##   raceblack raceother age raceblack:age raceother:age
## 1         1         0  48            48             0
## 2         1         0  52            52             0
## 3         0         0  87             0             0
## 4         0         0  82             0             0
## 5         0         1  67             0            67
## 6         0         1  53             0            53

Example 6.2

### simulate the data
## age from 40 to 80
age <- runif(n=60, min=40, max=80)
## race with three levels
race <- factor(c(rep("white", 20), rep("black", 20), rep("other", 20)))
race <- relevel(race, ref="white")

### simulate the survival by a exponential distribution
### with the rate parameter
log.rate.vec = -4.5+c(rep(0,20),rep(1,20),rep(2,20))+age*0.05
### and no censoring for the survival status
tt = rexp(n=60, rate=exp(log.rate.vec))
status = rep(1,60)

### now fit a cox proportional hazards model
library(survival)
result.cox = coxph(Surv(tt, status)~race+age)
summary(result.cox)

## Call:
## coxph(formula = Surv(tt, status) ~ race + age)
## 
##   n= 60, number of events= 60 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)    
## raceblack  1.05400   2.86909  0.35139 3.000 0.002704 ** 
## raceother  2.34552  10.43875  0.43276 5.420 5.96e-08 ***
## age        0.05201   1.05339  0.01359 3.827 0.000129 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## raceblack     2.869     0.3485     1.441     5.713
## raceother    10.439     0.0958     4.470    24.379
## age           1.053     0.9493     1.026     1.082
## 
## Concordance= 0.732  (se = 0.035 )
## Likelihood ratio test= 37.89  on 3 df,   p=3e-08
## Wald test            = 33.76  on 3 df,   p=2e-07
## Score (logrank) test = 37.83  on 3 df,   p=3e-08

6.3 Hypothesis Testing for Nested Models

Nested models: the covariates of one model is a subset of the covariates in the other

We can compare nested models using a (partial) likelihood ratio test

library(asaur)
attach(pharmacoSmoking)

## The following objects are masked _by_ .GlobalEnv:
## 
##     age, race

## model A: ageGroup4
modelA.coxph = coxph(Surv(ttr, relapse)~ageGroup4)
modelA.coxph

## Call:
## coxph(formula = Surv(ttr, relapse) ~ ageGroup4)
## 
##                   coef exp(coef) se(coef)      z      p
## ageGroup435-49  0.0293    1.0297   0.3093  0.095 0.9245
## ageGroup450-64 -0.7914    0.4532   0.3361 -2.355 0.0185
## ageGroup465+   -0.3173    0.7281   0.4435 -0.715 0.4744
## 
## Likelihood ratio test=12.22  on 3 df, p=0.006664
## n= 125, number of events= 89

## model B: employment
modelB.coxph = coxph(Surv(ttr, relapse)~employment)
modelB.coxph

## Call:
## coxph(formula = Surv(ttr, relapse) ~ employment)
## 
##                   coef exp(coef) se(coef)     z     p
## employmentother 0.1982    1.2192   0.2371 0.836 0.403
## employmentpt    0.4500    1.5683   0.3229 1.394 0.163
## 
## Likelihood ratio test=2.06  on 2 df, p=0.357
## n= 125, number of events= 89

## model C: ageGroup4+employment
modelC.coxph = coxph(Surv(ttr, relapse)~ageGroup4+employment)
modelC.coxph

## Call:
## coxph(formula = Surv(ttr, relapse) ~ ageGroup4 + employment)
## 
##                    coef exp(coef) se(coef)      z       p
## ageGroup435-49  -0.1299    0.8782   0.3213 -0.404 0.68594
## ageGroup450-64  -1.0239    0.3592   0.3585 -2.856 0.00429
## ageGroup465+    -0.7825    0.4573   0.5046 -1.551 0.12102
## employmentother  0.5257    1.6917   0.2748  1.913 0.05577
## employmentpt     0.5001    1.6489   0.3315  1.508 0.13143
## 
## Likelihood ratio test=16.79  on 5 df, p=0.004922
## n= 125, number of events= 89

### log-likelihood
logLik(modelA.coxph)

## 'log Lik.' -380.043 (df=3)

logLik(modelB.coxph)

## 'log Lik.' -385.1232 (df=2)

logLik(modelC.coxph)

## 'log Lik.' -377.7597 (df=5)

### model A vs model C
pchisq(2*(as.numeric(logLik(modelC.coxph))-as.numeric(logLik(modelA.coxph))), df=5-3, lower.tail=F)

## [1] 0.1019462

### model B vs model C
pchisq(2*(as.numeric(logLik(modelC.coxph))-as.numeric(logLik(modelB.coxph))), df=5-2, lower.tail=F)

## [1] 0.002065407

### model A vs null
model.null.coxph = coxph(Surv(ttr, relapse)~1)
logLik(model.null.coxph)

## 'log Lik.' -386.1533 (df=0)

pchisq(2*(as.numeric(logLik(modelA.coxph))-as.numeric(logLik(model.null.coxph))), df=3-0, lower.tail=F)

## [1] 0.006664492

### the anova() function is a more direct way
anova(modelA.coxph, modelC.coxph)

## Analysis of Deviance Table
##  Cox model: response is  Surv(ttr, relapse)
##  Model 1: ~ ageGroup4
##  Model 2: ~ ageGroup4 + employment
##    loglik  Chisq Df Pr(>|Chi|)
## 1 -380.04                     
## 2 -377.76 4.5666  2     0.1019

6 Model Selection and Interpretation

6.1 Covariate Adjustment

6.2 Categorical and Continuous Covariates

6.3 Hypothesis Testing for Nested Models

6.4 The Akaike Information Criterion for Comparing Non-nested Models

6.5 Including Smooth Estimates of Continuous Covariates in a Survival Model