Chapter 8 Poisson regression & analysis of curved effects

This exercise deals with modelling incidence rates using Poisson regression. Our special interest is in estimating and reporting curved effects of continuous explanatory variables on the hazard rate, i.e. the theoretical incidence rate

We analyse the testisDK data found in the Epi package. It contains the numbers of cases of testis cancer and mid-year populations (person-years) in 1-year age groups in Denmark during 1943-96. In this analysis age and calendar time are first treated as categorical but finally, a penalized spline model is fitted.

8.1 Testis cancer: Data input and housekeeping

Load the packages and the data set, and inspect its structure:

library(Epi)
library(mgcv)

## Loading required package: nlme

## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.

data(testisDK)
str(testisDK)

## 'data.frame':    4860 obs. of  4 variables:
##  $ A: num  0 1 2 3 4 5 6 7 8 9 ...
##  $ P: num  1943 1943 1943 1943 1943 ...
##  $ D: num  1 1 0 1 0 0 0 0 0 0 ...
##  $ Y: num  39650 36943 34588 33267 32614 ...

summary(testisDK)

##        A              P              D                Y          
##  Min.   : 0.0   Min.   :1943   Min.   : 0.000   Min.   :  471.7  
##  1st Qu.:22.0   1st Qu.:1956   1st Qu.: 0.000   1st Qu.:18482.2  
##  Median :44.5   Median :1970   Median : 1.000   Median :28636.0  
##  Mean   :44.5   Mean   :1970   Mean   : 1.812   Mean   :26239.8  
##  3rd Qu.:67.0   3rd Qu.:1983   3rd Qu.: 2.000   3rd Qu.:36785.5  
##  Max.   :89.0   Max.   :1996   Max.   :17.000   Max.   :47226.8

head(testisDK)

##   A    P D        Y
## 1 0 1943 1 39649.50
## 2 1 1943 1 36942.83
## 3 2 1943 0 34588.33
## 4 3 1943 1 33267.00
## 5 4 1943 0 32614.00
## 6 5 1943 0 32020.33

There are nearly 5000 observations from 90 one-year age groups and 54 calendar years. To get a clearer picture of what’s going on, we do some housekeeping. The age range will be limited to 15-79 years, and age and period are both categorized into 5-year intervals – following to the time-honoured practice in epidemiology.

tdk <- subset(testisDK, A > 14 & A < 80)
tdk$Age <- cut(tdk$A, br = 5 * (3:16), include.lowest = TRUE, right = FALSE)
nAge <- length(levels(tdk$Age))
tdk$Per <- cut(tdk$P,
  br = seq(1943, 1998, by = 5),
  include.lowest = TRUE, right = FALSE)
nPer <- length(levels(tdk$Per))

8.2 Some descriptive analysis

Computation and tabulation of incidence rates

Tabulate numbers of cases and person-years, and compute the incidence rates (per 100,000 y) in each 5 y \(\times\) 5 y cell using stat.table(). Take a look at the structure of the thus created object

tab <- stat.table(
  index = list(Age, Per),
  contents = list(
    D = sum(D),
    Y = sum(Y / 1000),
    rate = ratio(D, Y, 10^5)
  ),
  margins = TRUE,
  data = tdk
)
str(tab)

##  'stat.table' num [1:3, 1:14, 1:12] 10 773.81 1.29 30 813.02 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ contents: Named chr [1:3] "D" "Y" "rate"
##   .. ..- attr(*, "names")= chr [1:3] "D" "Y" "rate"
##   ..$ Age     : chr [1:14] "[15,20)" "[20,25)" "[25,30)" "[30,35)" ...
##   ..$ Per     : chr [1:12] "[1943,1948)" "[1948,1953)" "[1953,1958)" "[1958,1963)" ...
##  - attr(*, "table.fun")= chr [1:3] "sum" "sum" "ratio"

The table is too wide to be readable as such. A graphical presentation is more informative.

From the saved table object tab you can plot an age-incidence curve for each period separately, after you have checked the structure of the table, so that you know the relevant dimensions in it. There is a function rateplot() in Epi that does default plotting of tables of rates (see the help page of rateplot)

str(tab)

##  'stat.table' num [1:3, 1:14, 1:12] 10 773.81 1.29 30 813.02 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ contents: Named chr [1:3] "D" "Y" "rate"
##   .. ..- attr(*, "names")= chr [1:3] "D" "Y" "rate"
##   ..$ Age     : chr [1:14] "[15,20)" "[20,25)" "[25,30)" "[30,35)" ...
##   ..$ Per     : chr [1:12] "[1943,1948)" "[1948,1953)" "[1953,1958)" "[1958,1963)" ...
##  - attr(*, "table.fun")= chr [1:3] "sum" "sum" "ratio"

par(mfrow = c(1, 1))
rateplot(
  rates = tab[3, 1:nAge, 1:nPer], which = "ap", ylim = c(1, 30),
  age = seq(15, 75, 5), per = seq(1943, 1993, 5),
  col = heat.colors(16), ann = TRUE
)

What can you conclude about the trend in age-specific incidence rates over calendar time? What about the effect of age; is there any common pattern in the age-incidence curves across the periods?

8.3 Age and period as categorical factors

We shall first fit a Poisson regression model with log link on age and period model in the traditional way, in which both factors are treated as categorical. The model is additive on the log-rate scale. It is useful to scale the person-years to be expressed in \(10^5\) y.

In fitting the model we utilize the poisreg family object found in package Epi.

tdk$Y <- tdk$Y / 100000
mCat <- glm(cbind(D, Y) ~ Age + Per,
  family = poisreg(link = log), data = tdk )
round(ci.exp(mCat), 2)

##                exp(Est.) 2.5% 97.5%
## (Intercept)         1.47 1.26  1.72
## Age[20,25)          3.13 2.75  3.56
## Age[25,30)          4.90 4.33  5.54
## Age[30,35)          5.50 4.87  6.22
## Age[35,40)          4.78 4.22  5.42
## Age[40,45)          3.66 3.22  4.16
## Age[45,50)          2.60 2.27  2.97
## Age[50,55)          1.94 1.68  2.25
## Age[55,60)          1.47 1.25  1.72
## Age[60,65)          0.98 0.82  1.18
## Age[65,70)          0.92 0.76  1.12
## Age[70,75)          0.90 0.73  1.12
## Age[75,80]          0.86 0.67  1.11
## Per[1948,1953)      1.12 0.96  1.30
## Per[1953,1958)      1.30 1.13  1.50
## Per[1958,1963)      1.53 1.33  1.76
## Per[1963,1968)      1.68 1.47  1.92
## Per[1968,1973)      1.98 1.74  2.25
## Per[1973,1978)      2.33 2.05  2.64
## Per[1978,1983)      2.66 2.35  3.01
## Per[1983,1988)      2.83 2.50  3.20
## Per[1988,1993)      3.08 2.73  3.47
## Per[1993,1998]      3.31 2.93  3.74

What do the estimated rate ratios tell about the age and period effects?

A graphical inspection of point estimates and confidence intervals can be obtained as follows. In the beginning it is useful to define shorthands for the pertinent mid-age and mid-period values of the different intervals

aMid <- seq(17.5, 77.5, by = 5)
pMid <- seq(1945, 1995, by = 5)
par(mfrow = c(1, 2))
matplot(aMid, rbind(c(1,1,1), ci.exp(mCat)[2:13, ]), type = "o", pch = 16,     
   log = "y", cex.lab = 1.5, cex.axis = 1.5, col= c("black", "blue", "blue"),
  xlab = "Age (years)", ylab = "Rate ratio" )
matplot(pMid, rbind(c(1,1,1), ci.exp(mCat)[14:23, ]), type = "o", pch = 16,
  log = "y", cex.lab = 1.5, cex.axis = 1.5, col=c("black", "blue", "blue"),
  xlab = "Calendar year - 1900", ylab = "Rate ratio" )

In the fitted model the reference category for each factor was the first one. As age is the dominating factor, it may be more informative to remove the intercept from the model. As a consequence the age effects describe fitted rates at the reference level of the period factor. For the latter one could choose the middle period 1968-72 using Relevel().

tdk$Per70 <- Relevel(tdk$Per, ref = 6)
mCat2 <- glm(cbind(D, Y) ~ -1 + Age + Per70,
  family = poisreg(link = log), data = tdk )
round(ci.exp(mCat2), 2)

##                  exp(Est.)  2.5% 97.5%
## Age[15,20)            2.91  2.55  3.33
## Age[20,25)            9.12  8.31 10.01
## Age[25,30)           14.28 13.11 15.55
## Age[30,35)           16.03 14.72 17.46
## Age[35,40)           13.94 12.76 15.23
## Age[40,45)           10.66  9.71 11.71
## Age[45,50)            7.57  6.83  8.39
## Age[50,55)            5.67  5.05  6.36
## Age[55,60)            4.28  3.75  4.88
## Age[60,65)            2.85  2.43  3.35
## Age[65,70)            2.68  2.25  3.19
## Age[70,75)            2.63  2.16  3.20
## Age[75,80]            2.51  1.98  3.18
## Per70[1943,1948)      0.51  0.44  0.58
## Per70[1948,1953)      0.57  0.50  0.64
## Per70[1953,1958)      0.66  0.58  0.74
## Per70[1958,1963)      0.77  0.69  0.87
## Per70[1963,1968)      0.85  0.76  0.95
## Per70[1973,1978)      1.18  1.07  1.30
## Per70[1978,1983)      1.35  1.22  1.48
## Per70[1983,1988)      1.43  1.30  1.57
## Per70[1988,1993)      1.56  1.42  1.70
## Per70[1993,1998]      1.67  1.53  1.84

Let us also plot estimates from the latter model, too.

par(mfrow = c(1, 2))
matplot(aMid, rbind(c(1,1,1), ci.exp(mCat2)[2:13, ]), type = "o", pch = 16,     
   log = "y", cex.lab = 1.5, cex.axis = 1.5, col=c("black", "blue", "blue"),
  xlab = "Age (years)", ylab = "Rate" )
matplot(pMid, rbind(ci.exp(mCat2)[14:18, ], c(1,1,1), ci.exp(mCat2)[19:23, ]),
        type = "o", pch = 16, log = "y", cex.lab = 1.5, cex.axis = 1.5,
        col=c("black", "blue", "blue"),
  xlab = "Calendar year - 1900", ylab = "Rate ratio" )
abline(h = 1, col = "gray")

8.4 Generalized additive model with penalized splines

It is obvious that the age effect on the log-rate scale is highly non-linear. Yet, it is less clear whether the true period effect deviates from linearity. Nevertheless, there are good reasons to try fitting smooth continuous functions for both time scales.

As the next task we fit a generalized additive model for the log-rate on continuous age and period applying penalized splines with default settings of function gam() in package mgcv. In this fitting an optimal value for the penalty parameter is chosen based on an AIC-like criterion known as UBRE (‘Un-Biased Risk Estimator’)

library(mgcv)
mPen <- mgcv::gam(cbind(D, Y) ~ s(A) + s(P),
  family = poisreg(link = log), data = tdk
)
summary(mPen)

## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## cbind(D, Y) ~ s(A) + s(P)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.70960    0.01793   95.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##        edf Ref.df Chi.sq p-value    
## s(A) 8.143  8.765   2560  <2e-16 ***
## s(P) 3.046  3.790   1054  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.598   Deviance explained = 53.6%
## UBRE = 0.082051  Scale est. = 1         n = 3510

The summary is quite brief, and the only estimated coefficient is the intercept, which sets the baseline level for the log-rates, against which the relative age effects and period effects will be contrasted. On the rate scale the baseline level 5.53 per 100000 y is obtained by exp(1.7096).

See also the default plot for the fitted curves (solid lines) describing the age and the period effects which are interpreted as contrasts to the baseline level on the log-rate scale.

par(mfrow = c(1, 2))
plot(mPen, se=2, seWithMean = TRUE)

The dashed lines describe the approximate 95% confidence band for the pertinent curve. One could get the impression that year 1968 would be some kind of reference value for the period effect, like period 1968-72 chosen as the reference in the categorical model previously fitted. This is not the case, however, because gam() by default parametrizes the spline effects such that the reference level, at which the spline effect is nominally zero, is the overall grand mean value of the log-rate in the data. This corresponds to the principle of sum contrasts (contr.sum) for categorical explanatory factors.

From the summary you will also find that the degrees of freedom value required for the age effect is nearly the same as the default dimension \(k-1 = 9\) of the part of the model matrix (or basis) initially allocated for each smooth function. (Here \(k\) refers to the relevant argument that determines the basis dimension when specifying a smooth term by s() in the model formula). On the other hand the period effect takes just about 3 df.

It is a good idea to do some diagnostic checking of the fitted model

par(mfrow = c(2, 2))
gam.check(mPen)

## 
## Method: UBRE   Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-9.390731e-10,1.362468e-06]
## (score 0.0820511 & scale 1).
## Hessian positive definite, eigenvalue range [0.0002209238,0.0003824007].
## Model rank =  19 / 19 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##        k'  edf k-index p-value    
## s(A) 9.00 8.14    0.93  <2e-16 ***
## s(P) 9.00 3.05    0.95   0.095 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The four diagnostic plots are analogous to some of those used in the context of linear models for Gaussian responses, but not all of them may be as easy to interpret. – Pay attention to the note given in the printed output about the value of k.

Let us refit the model but now with an increased k for age:

mPen2 <- mgcv::gam(cbind(D, Y) ~ s(A, k = 20) + s(P),
  family = poisreg(link = log), data = tdk
)
summary(mPen2)

## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## cbind(D, Y) ~ s(A, k = 20) + s(P)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.70863    0.01795   95.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##         edf Ref.df Chi.sq p-value    
## s(A) 11.132 13.406   2553  <2e-16 ***
## s(P)  3.045  3.788   1054  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.599   Deviance explained = 53.7%
## UBRE = 0.081809  Scale est. = 1         n = 3510

par(mfrow = c(2, 2))
gam.check(mPen2)

## 
## Method: UBRE   Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-2.397829e-12,2.90395e-09]
## (score 0.08180917 & scale 1).
## Hessian positive definite, eigenvalue range [0.00022158,0.0009322215].
## Model rank =  29 / 29 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##         k'   edf k-index p-value    
## s(A) 19.00 11.13    0.93  <2e-16 ***
## s(P)  9.00  3.05    0.95    0.14    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With this choice of k the df value for age became about 11, which is well below \(k-1 = 19\). Let us plot the fitted curves from this fitting, too

par(mfrow = c(1, 2))
plot(mPen2, seWithMean = TRUE)
abline(v = 1968, h = 0, lty = 3)

There does not seem to have happened any essential changes from the previously fitted curves, so maybe 8 df could, after all, be quite enough for the age effect.

Graphical presentation of the effects using plot.gam() can be improved. For instance, we may present the age effect to describe the mean incidence rates by age, averaged over the whole time span of 54 years. This is obtained by adding the estimated intercept to the estimated smooth curve for the age effect and showing the antilogarithms of the ordinates of the curve. For that purpose we need to extract the intercept and modify the labels of the \(y\)-axis accordingly. The estimated period curve can also be expressed in terms of relative indidence rates in relation to the fitted baseline rate, as determined by the model intercept.

par(mfrow = c(1, 2))
icpt <- coef(mPen2)[1] #  estimated intecept
plot(mPen2,
  seWithMean = TRUE, select = 1, rug = FALSE,
  yaxt = "n", ylim = c(log(1), log(20)) - icpt,
  xlab = "Age (y)", ylab = "Mean rate (/100000 y)"
)
axis(2, at = log(c(1, 2, 5, 10, 20)) - icpt, labels = c(1, 2, 5, 10, 20))
plot(mPen2,
  seWithMean = TRUE, select = 2, rug = FALSE,
  yaxt = "n", ylim = c(log(0.4), log(2)),
  xlab = "Calendat year", ylab = "Relative rate"
)
axis(2, at = log(c(0.5, 0.75, 1, 1.5, 2)), labels = c(0.5, 0.75, 1, 1.5, 2))
abline(v = 1968, h = 0, lty = 3)

Homework You could continue the analysis of these data by fitting an age-cohort model as an alternative to the age-period model, as well as an age-cohort-period model utilizing function apc.fit() in Epi. See (http://bendixcarstensen.com/APC/) for details.