Chapter 12 Time-splitting, time-scales and SMR

This exercise is about mortaity among Danish Diabetes patients. It is based on the dataset DMlate, a random sample of 10,000 patients from the Danish Diabetes Register (scrambeled dates), all with date of diagnosis after 1994.

Start by loading the relevant packages:

library(Epi)
library(popEpi)
library(mgcv)
library(tidyverse)

Then load the data and take a look at the data:

data(DMlate)
str(DMlate)

'data.frame':   10000 obs. of  7 variables:
 $ sex  : Factor w/ 2 levels "M","F": 2 1 2 2 1 2 1 1 2 1 ...
 $ dobth: num  1940 1939 1918 1965 1933 ...
 $ dodm : num  1999 2003 2005 2009 2009 ...
 $ dodth: num  NA NA NA NA NA ...
 $ dooad: num  NA 2007 NA NA NA ...
 $ doins: num  NA NA NA NA NA NA NA NA NA NA ...
 $ dox  : num  2010 2010 2010 2010 2010 ...

You can get a more detailed explanation of the data by referring to the help page:

?DMlate

Set up the dataset as a Lexis object with age, calendar time and duration of diabetes as timescales, and date of death as event. Make sure that you know what each of the arguments to Lexis mean:

LL <- Lexis(entry = list(A = dodm - dobth, 
                         P = dodm, 
                       dur = 0),
             exit = list(P = dox),
      exit.status = factor(!is.na(dodth), 
                           labels = c("Alive", "Dead")),
             data = DMlate)

NOTE: entry.status has been set to "Alive" for all.
NOTE: Dropping  4  rows with duration of follow up < tol

Take a look at the first few lines of the resulting dataset, for example using head().

Get an overview of the mortality by using stat.table to tabulate no. deaths, person-years (lex.dur) and the crude mortality rate by sex. Try:

stat.table(sex,
           list(D = sum(lex.Xst == "Dead"),
                Y = sum(lex.dur),
             rate = ratio(lex.Xst == "Dead", 
                          lex.dur, 
                          1000)),
          margins = TRUE,
             data = LL)

 --------------------------------- 
 sex           D        Y    rate  
 --------------------------------- 
 M       1343.00 27614.21   48.63  
 F       1156.00 26659.05   43.36  

 Total   2499.00 54273.27   46.04  
 ---------------------------------

# stat.table is more versatile than xtabs:
xtabs(cbind(D = lex.Xst == "Dead",
            Y = lex.dur) 
      ~ sex, 
      data = LL)


sex     D     Y
  M  1343 27614
  F  1156 26659

If we want to assess how mortality depends on age, calendar time and duration or how it relates to population mortality, we should in principle split the follow-up along all three time scales. In practice it is sufficient to split it along one of the time-scales and then use the value of each of the time-scales at the left endpoint of the intervals.

Use splitLexis (or splitMulti from the popEpi package) to split the follow-up along the age-axis in sutiable intervals (here set to 1/2 year, but really immaterial as long as it is small):
```
SL <- splitLexis(LL, 
                 breaks = seq(0, 125, 1 / 2), 
             time.scale = "A")
summary(SL)
```
```
Transitions:
     To
From     Alive Dead  Records:  Events: Risk time:  Persons:
  Alive 115974 2499    118473     2499      54273      9996
```
How many records are now in the dataset? How many person-years? Compare to the original Lexis-dataset.

12.1 Age-specific mortality

Now estimate age-specific mortality curves for men and women separately, using splines as implemented in gam. We use k = 20 to be sure to catch any irregularities by age.

r.m <- mgcv::gam(cbind(lex.Xst == "Dead", lex.dur) ~ s(A, k = 20),
                 family = poisreg,
                   data = subset(SL, sex == "M"))

Make sure you understand all the components on this modeling statement. Fit the same model for women.

There is a convenient wrapper for this, exploiting the Lexis structure of data, but which does not have an update

r.m <- gam.Lexis(subset(SL, sex == "M"), ~ s(A, k = 20))

mgcv::gam Poisson analysis of Lexis object subset(SL, sex == "M") with log link:
Rates for the transition:
Alive->Dead

r.f <- gam.Lexis(subset(SL, sex == "F"), ~ s(A, k = 20))

mgcv::gam Poisson analysis of Lexis object subset(SL, sex == "F") with log link:
Rates for the transition:
Alive->Dead

Now, extract the estimated rates by using the wrapper function ci.pred that computes predicted rates and confidence limits for these.

glm.Lexis and gam.Lexis use the poisreg family that will return the rates in the (inverse) units in which the person-years were given; that is the units in which lex.dur is recorded.

nd <- data.frame(A = seq(20, 90, 0.5))
p.m <- ci.pred(r.m, newdata = nd)
p.f <- ci.pred(r.f, newdata = nd)
str(p.m)

 num [1:141, 1:3] 0.00132 0.00137 0.00142 0.00147 0.00152 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:141] "1" "2" "3" "4" ...
  ..$ : chr [1:3] "Estimate" "2.5%" "97.5%"

Plot the predicted rates for men and women together - using for example matplot or matshade.

    par(mar = c(3.5,3.5,1,1),
        mgp = c(3,1,0) / 1.6,
        las = 1,
        bty = "n", 
       lend = "butt")
matplot(nd$A, cbind(p.m, p.f) * 1000,
        type = "l",
         col = rep(c("blue", "red"), each = 3),
         lwd = c(3, 1, 1),
         lty = 1,
         log = "y", yaxt = "n",
        xlab = "Age", 
        ylab = "Mortality per 1000 PY")
axis(side = 2, 
     at = ll <- outer( c(5, 10, 20), -1:1, function(x,y) x * 10^y),
     labels = ll)

Further time scales: period and duration

We now want to model the mortality rates among diabetes patients also including current date and duration of diabetes, using penalized splines. Use the argument bs = "cr" to s() to get cubic splines instead of thin plate ("tp") splines which is the default.

As before specify the model exploiting the Lexis class of the dataset, try:

Mcr <- gam.Lexis(subset(SL, sex == "M"),
                 ~ s(A, bs = "cr", k = 10) +
                   s(P, bs = "cr", k = 10) +
                 s(dur, bs = "cr", k = 10))

mgcv::gam Poisson analysis of Lexis object subset(SL, sex == "M") with log link:
Rates for the transition:
Alive->Dead

summary(Mcr)


Family: poisson 
Link function: log 

Formula:
cbind(trt(Lx$lex.Cst, Lx$lex.Xst) %in% trnam, Lx$lex.dur) ~ s(A, 
    bs = "cr", k = 10) + s(P, bs = "cr", k = 10) + s(dur, bs = "cr", 
    k = 10)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -3.5407     0.0494   -71.7   <2e-16

Approximate significance of smooth terms:
        edf Ref.df Chi.sq p-value
s(A)   3.65   4.52 1013.2 < 2e-16
s(P)   1.02   1.05   17.6 3.5e-05
s(dur) 7.59   8.38   74.5 < 2e-16

R-sq.(adj) =  0.00333   Deviance explained = 9.87%
UBRE = -0.8054  Scale est. = 1         n = 60347

Fit the same model for women as well. Are the models reasonably fitting?

Fcr <- gam.Lexis(subset(SL, sex == "F"),
                 ~ s(A, bs = "cr", k = 10) +
                   s(P, bs = "cr", k = 10) +
                 s(dur, bs = "cr", k = 10))

mgcv::gam Poisson analysis of Lexis object subset(SL, sex == "F") with log link:
Rates for the transition:
Alive->Dead

summary(Fcr)


Family: poisson 
Link function: log 

Formula:
cbind(trt(Lx$lex.Cst, Lx$lex.Xst) %in% trnam, Lx$lex.dur) ~ s(A, 
    bs = "cr", k = 10) + s(P, bs = "cr", k = 10) + s(dur, bs = "cr", 
    k = 10)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -3.7848     0.0581   -65.2   <2e-16

Approximate significance of smooth terms:
        edf Ref.df Chi.sq p-value
s(A)   2.67   3.37  988.5 < 2e-16
s(P)   1.90   2.39   20.1 0.00014
s(dur) 5.97   6.97   39.0 < 2e-16

R-sq.(adj) =  0.00417   Deviance explained = 11.1%
UBRE = -0.82405  Scale est. = 1         n = 58126

Plot the estimated effects, using the default plot method for gam objects. Remember that there are three effects estimated, so it is useful set up a multi-panel display, and for the sake of comparability to set ylim to the same for men and women:
```
par(mfrow = c(2, 3))
plot(Mcr, ylim = c(-3, 3))
plot(Fcr, ylim = c(-3, 3))
```
```
par(mfcol = c(3, 2))
plot(Mcr, ylim = c(-3, 3))
plot(Fcr, ylim = c(-3, 3))
```
What is the absolute scale for these effects?

Compare the fit of the naive model with just age and the three-factor models, using anova, e.g.:

anova(Mcr, r.m, test = "Chisq")

Analysis of Deviance Table

Model 1: cbind(trt(Lx$lex.Cst, Lx$lex.Xst) %in% trnam, Lx$lex.dur) ~ s(A, 
    bs = "cr", k = 10) + s(P, bs = "cr", k = 10) + s(dur, bs = "cr", 
    k = 10)
Model 2: cbind(trt(Lx$lex.Cst, Lx$lex.Xst) %in% trnam, Lx$lex.dur) ~ s(A, 
    k = 20)
  Resid. Df Resid. Dev    Df Deviance Pr(>Chi)
1     60332      11717                        
2     60340      11812 -7.95    -95.1   <2e-16

What do you conclude?

The model we fitted has three time-scales: current age, current date and current duration of diabetes, so the effects that we report are not immediately interpretable, as they are (as in any kind of multiple regressions) to be interpreted as all else equal which they are not, as the three time scales advance simultaneously at the same pace.

The reporting would therefore more naturally be on the mortality scale as a function of age, but showing the mortality for persons diagnosed in different ages, using separate displays for separate years of diagnosis. This is most easily done using the ci.pred function with the newdata = argument. So a person diagnosed in age 50 in 1995 will have a mortality measured in cases per 1000 PY as:
```
pts <- seq(0, 12, 1/4)
nd <- data.frame(A = 50   + pts, 
                 P = 1995 + pts, 
               dur =        pts)
m.pr <- ci.pred(Mcr, newdata = nd)
```
Note that because we used gam.Lexis which uses the poisreg family we need not specify lex.dur as a variable in the prediction data frame nd. Predictions will be rates in the same units as lex.dur (well, the inverse).

Now take a look at the result from the ci.pred statement and construct prediction of mortality for men and women diagnosed in a range of ages, say 50, 60, 70, and plot these together in the same graph:
```
cbind(nd, ci.pred(Mcr, newdata = nd))[1:10,]
```

From figure it seems that the duration effect is over-modeled, so refit constraining the d.f. to 5:

Mcr <- gam.Lexis(subset(SL, sex == "M"),
                 ~ s(A, bs = "cr", k = 10) +
                   s(P, bs = "cr", k = 10) +
                 s(dur, bs = "cr", k = 5))

mgcv::gam Poisson analysis of Lexis object subset(SL, sex == "M") with log link:
Rates for the transition:
Alive->Dead

Fcr <- gam.Lexis(subset(SL, sex == "F"),
                 ~ s(A, bs = "cr", k = 10) +
                   s(P, bs = "cr", k = 10) +
                 s(dur, bs = "cr", k = 5))

mgcv::gam Poisson analysis of Lexis object subset(SL, sex == "F") with log link:
Rates for the transition:
Alive->Dead

Plot the estimated rates from the revised models. What do you conclude from the plots?

12.2 SMR

The SMR is the Standardized Mortality Ratio, which is the mortality rate-ratio between the diabetes patients and the general population. In real studies we would subtract the deaths and the person-years among the diabetes patients from those of the general population, but since we do not have access to these, we make the comparison to the general population at large, i.e. also including the diabetes patients.

So we now want to include the population mortality rates as a fixed variable in the split dataset; for each record in the split dataset we attach the value of the population mortality for the relevant sex, and and calendar time.

This can be achieved in two ways: Either we just use the current split of follow-up time and allocate the population mortality rates for some suitably chosen (mid-)point of the follow-up in each, or we make a second split by date, so that follow-up in the diabetes patients is in the same classification of age and data as the population mortality table.

We will use the former approach, using the dataset split in 6 month intervals, and then include as an extra variable the population mortality as available from the data set M.dk.

First create the variables in the diabetes dataset that we need for matching with the population mortality data, that is sex and age and date at the midpoint of each of the intervals (or rather at a point 3 months after the left endpoint of the interval; recall we split the follow-up in 6 month intervals).

We need to have variables of the same type when we merge, so we must transform the sex variable in M.dk to a factor, and must for each follow-up interval in the SL data have an age and a period variable that can be used in merging with the population data.

str(SL)

Classes 'Lexis' and 'data.frame':   118473 obs. of  14 variables:
 $ lex.id : int  1 1 1 1 1 1 1 1 1 1 ...
 $ A      : num  58.7 59 59.5 60 60.5 ...
 $ P      : num  1999 1999 2000 2000 2001 ...
 $ dur    : num  0 0.339 0.839 1.339 1.839 ...
 $ lex.dur: num  0.339 0.5 0.5 0.5 0.5 ...
 $ lex.Cst: Factor w/ 2 levels "Alive","Dead": 1 1 1 1 1 1 1 1 1 1 ...
 $ lex.Xst: Factor w/ 2 levels "Alive","Dead": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex    : Factor w/ 2 levels "M","F": 2 2 2 2 2 2 2 2 2 2 ...
 $ dobth  : num  1940 1940 1940 1940 1940 ...
 $ dodm   : num  1999 1999 1999 1999 1999 ...
 $ dodth  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ dooad  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ doins  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ dox    : num  2010 2010 2010 2010 2010 ...
 - attr(*, "breaks")=List of 3
  ..$ A  : num [1:251] 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 ...
  ..$ P  : NULL
  ..$ dur: NULL
 - attr(*, "time.scales")= chr [1:3] "A" "P" "dur"
 - attr(*, "time.since")= chr [1:3] "" "" ""

SL$Am <- floor(SL$A + 0.25)
SL$Pm <- floor(SL$P + 0.25)
data(M.dk)
str(M.dk)

'data.frame':   7800 obs. of  6 variables:
 $ A   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ sex : num  1 2 1 2 1 2 1 2 1 2 ...
 $ P   : num  1974 1974 1975 1975 1976 ...
 $ D   : num  459 303 435 311 405 258 332 205 312 233 ...
 $ Y   : num  35963 34383 36099 34652 34965 ...
 $ rate: num  12.76 8.81 12.05 8.97 11.58 ...
 - attr(*, "Contents")= chr "Number of deaths and risk time in Denmark"

M.dk <- transform(M.dk,
                  Am = A,
                  Pm = P,
                 sex = factor(sex, labels = c("M", "F")))
str(M.dk)

'data.frame':   7800 obs. of  8 variables:
 $ A   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ sex : Factor w/ 2 levels "M","F": 1 2 1 2 1 2 1 2 1 2 ...
 $ P   : num  1974 1974 1975 1975 1976 ...
 $ D   : num  459 303 435 311 405 258 332 205 312 233 ...
 $ Y   : num  35963 34383 36099 34652 34965 ...
 $ rate: num  12.76 8.81 12.05 8.97 11.58 ...
 $ Am  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Pm  : num  1974 1974 1975 1975 1976 ...

Then match the rates from M.dk into SL; sex, Am and Pm are the common variables, and therefore the match is on these variables:

SLr <- merge(SL, 
             M.dk[, c("sex", "Am", "Pm", "rate")])
dim(SL)

[1] 118473     16

dim(SLr)

[1] 118454     17

This merge (remember to ?merge!) only takes rows that have information from both datasets, hence the slightly fewer rows in SLr than in SL.

Compute the expected number of deaths as the person-time multiplied (lex.dur) by the corresponding population rate, and put it in a new variable, E, say (Expected). Use stat.table to make a table of observed, expected and the ratio (SMR) by age (suitably grouped, look for cut) and sex.

Fit a poisson model with sex as the explanatory variable and log-expected as offset to derive the SMR (and c.i.). Some of the population mortality rates are 0, so you need to exclude those records from the analysis.

msmr <- glm((lex.Xst == "Dead") ~ sex - 1,
            offset = log(E),
            family = poisson,
              data = subset(SLr, E > 0))
ci.exp(msmr)

     exp(Est.) 2.5% 97.5%
sexM      1.69 1.60  1.78
sexF      1.54 1.46  1.63

Do you recognize the numbers?

The same model can be fitted a bit simpler by the poisreg family, try:

msmr <- glm(cbind(lex.Xst == "Dead", E) ~ sex - 1, 
            family = poisreg,
              data = subset(SLr, E > 0))
ci.exp(msmr)

     exp(Est.) 2.5% 97.5%
sexM      1.69 1.60  1.78
sexF      1.54 1.46  1.63

We can assess the ratios of SMRs between men and women by using the ctr.mat argument which should be a matrix:

(CM <- rbind(M = c(1, 0),
             W = c(0, 1),
         "M/F" = c(1, -1)))

    [,1] [,2]
M      1    0
W      0    1
M/F    1   -1

round(ci.exp(msmr, ctr.mat = CM), 2)

    exp(Est.) 2.5% 97.5%
M        1.69 1.60  1.78
W        1.54 1.46  1.63
M/F      1.09 1.01  1.18

What do you conclude about the mortality rates among men and women?

12.3 SMR modeling

Now model the SMR using age and date of diagnosis and diabetes duration as explanatory variables, including the expected-number instead of the person-years, using separate models for men and women.

Note that you cannot use gam.Lexis from the code you used for fitting models for the rates, you need to use gam with the poisreg family. And remember to exclude those units where no deaths in the population occur (that is where the population mortality rate is 0).

Msmr <- gam(cbind(lex.Xst == "Dead", E) 
            ~ s(  A, bs = "cr", k = 5) +
              s(  P, bs = "cr", k = 5) +
              s(dur, bs = "cr", k = 5),
            family = poisreg,
              data = subset(SLr, E > 0 & sex == "M"))
ci.exp(Msmr)

            exp(Est.)  2.5% 97.5%
(Intercept)     2.167 2.000 2.347
s(A).1          0.596 0.518 0.686
s(A).2          0.893 0.865 0.921
s(A).3          0.366 0.282 0.476
s(A).4          0.459 0.376 0.559
s(P).1          0.989 0.973 1.006
s(P).2          0.961 0.909 1.016
s(P).3          0.903 0.783 1.041
s(P).4          0.917 0.813 1.035
s(dur).1        0.658 0.577 0.751
s(dur).2        0.845 0.761 0.937
s(dur).3        0.783 0.692 0.886
s(dur).4        1.818 1.104 2.992

Fsmr <- update(Msmr, data = subset(SLr, E > 0 & sex == "F"))

Plot the estimated smooth effects for both men and women using e.g. plot.gam. What do you see?

Plot the predicted SMRs from the models for men and women diagnosed in ages 50, 60 and 70 as you did for the rates. What do you see?

par(mfrow = c(1,1))
n50 <- nd
n60 <- mutate(n50, A = A + 10)
n70 <- mutate(n50, A = A + 20)
head(n70)

     A    P  dur
1 70.0 1995 0.00
2 70.2 1995 0.25
3 70.5 1996 0.50
4 70.8 1996 0.75
5 71.0 1996 1.00
6 71.2 1996 1.25

matshade(n50$A, cbind(ci.pred(Msmr, n50),
                      ci.pred(Fsmr, n50)), plot = TRUE,
         col = c("blue", "red"), lwd = 3,
         ylim = c(0.5, 5), log  = "y", xlim = c(50, 80))
matshade(n60$A, cbind(ci.pred(Msmr, n60),
                      ci.pred(Fsmr, n60)),
         col = c("blue", "red"), lwd = 3)
matshade(n70$A, cbind(ci.pred(Msmr, n70),
                      ci.pred(Fsmr, n70)),
         col = c("blue", "red"), lwd = 3)
abline(h = 1)
abline(v = 50 + 0:5, lty = 3, col = "gray")

Describe the shapes of the curves. What aspects of the shapes are induced by the model ?

Try to simplify the model to one with a simple sex effect, separate linear effects of age and date of follow-up for each sex, and a smooth effect of duration common for both sexes, giving an estimate of the change in SMR by age and calendar time.

Bsmr <- gam(cbind(lex.Xst == "Dead", E) 
            ~ sex / A +
              sex / P +
              s(dur, bs = "cr", k = 5),
            family = poisreg,
              data = subset(SLr, E > 0))
round(ci.exp(Bsmr)[-1,], 3)

         exp(Est.)  2.5%    97.5%
sexF      5.21e+04 0.000 2.38e+23
sexM:A    9.81e-01 0.977 9.86e-01
sexF:A    9.80e-01 0.975 9.86e-01
sexM:P    9.87e-01 0.972 1.00e+00
sexF:P    9.81e-01 0.965 9.98e-01
s(dur).1  6.63e-01 0.601 7.32e-01
s(dur).2  8.61e-01 0.795 9.31e-01
s(dur).3  8.85e-01 0.805 9.74e-01
s(dur).4  1.41e+00 0.945 2.11e+00

How much does SMR change by each year of age? And by each calendar year?

What is the meaning of the sexF parameter?

Use your previous code to plot the predicted mortality from this model too. Are the predicted SMR curves credible?

m50 <- mutate(n50, sex = "M")
f50 <- mutate(n50, sex = "F")
m60 <- mutate(m50, A = A + 10)
f60 <- mutate(f50, A = A + 10)
m70 <- mutate(m50, A = A + 20)
f70 <- mutate(f50, A = A + 20)
matshade(n50$A, cbind(ci.pred(Bsmr, m50),
                      ci.pred(Bsmr, f50)), plot = TRUE,
         col = c("blue", "red"), lwd = 3,
         ylim = c(0.5, 5), log  = "y", xlim = c(50, 80))
matshade(n60$A, cbind(ci.pred(Bsmr, m60),
                      ci.pred(Bsmr, f60)),
         col = c("blue", "red"), lwd = 3)
matshade(n70$A, cbind(ci.pred(Bsmr, m70),
                      ci.pred(Bsmr, f70)),
         col = c("blue", "red"), lwd = 3)
abline(h = 1)
abline(h = 1:5, lty = 3, col = "gray")

What is your conclusion about SMR for diabetes patients relative to the general popuation?