Chapter 6 Analysis of hazard rates, their ratios and differences and binary regression

This exercise is very prescriptive, so you should make an effort to really understand everything you type into R. Consult the relevant slides of the lecture on Poisson and Binary regression …

6.1 Hand calculations for a single rate

Let \(\lambda\) be the true hazard rate or theoretical incidence rate of a given outcome event. Its estimator is the empirical incidence rate \(\widehat\lambda = D/Y\) = no. cases/person-years. Recall that the standard error of the empirical rate is SE\((\widehat\lambda) = \widehat\lambda/\sqrt{D}\).

The simplest approximate 95% confidence interval (CI) for \(\lambda\) is given by \[ \widehat\lambda \pm 1.96 \times SE(\widehat\lambda) \]

Suppose \(15\) outcome events are observed during \(5532\) person-years in a given study cohort. Let’s use R as a simple desk calculator to estimate the underlying hazard rate \(\lambda\) (in 1000 person-years; therefore 5.532) and to get the first version of an approximate confidence interval:

library(Epi)
options(digits = 4) #  to cut down decimal points in the output

D <- 15
Y <- 5.532 # thousands of years!
rate <- D / Y
SE.rate <- rate / sqrt(D)
c(rate, SE.rate, rate + c(-1.96, 1.96) * SE.rate)

6.2 Poisson model for a single rate with logarithmic link

You are able to estimate the hazard rate \(\lambda\) and compute its CI with a Poisson regression model, as described in the relevant slides in the lecture handout.

Poisson regression is a generalized linear model in which the family, i.e. the distribution of the response variable, is assumed to be the Poisson distribution. The most commonly applied link function in Poisson regression is the natural logarithm; log for short.

A family object poisreg, a modified version of the original poisson family object, is available in package Epi. When using this, the response is defined as a matrix of two columns: numbers of cases \(D\) and person-years \(Y\), these being combined into a matrix by cbind(D,Y). No specification of offset is needed.

mreg <- glm(cbind(D, Y) ~ 1, family = poisreg(link = log))
ci.exp(mreg)

If you want confidence interval for log rate

mreg <- glm(cbind(D, Y) ~ 1, family = poisreg(link = log))
ci.lin(mreg)[, c(1, 5, 6)]

In this course we endorse the use of family poisreg because of its advantages in more general settings.

6.3 Poisson model for a single rate with identity link

The approach leaning on having the number of cases \(D\) as the response and log\((Y)\) as an offset, is limited only to models with log link. A major advantage of the poisreg family is that it allows a straighforward use of the identity link, too. With this link the response variable is the same, but the parameters to be directly estimated are now the rates itself and their differences, not the log-rates and their differences as with the log link.

Fit a Poisson model with identity link to our simple data, and use ci.lin() to produce the estimate and the confidence interval for the hazard rate from this model:

mid <- glm(cbind(D, Y) ~ 1, family = poisreg(link = "identity"))
ci.lin(mid)
ci.lin(mid)[, c(1, 5, 6)]

How is the coefficient of this model interpreted? Verify that you actually get the same rate estimate and CI as in section 1.6.1, item 1.

6.4 Poisson model assuming the same rate for several periods

Now, suppose the events and person years are collected over three distinct periods.

Read in the data and compute period-specific rates

Dx <- c(3, 7, 5)
Yx <- c(1.412, 2.783, 1.337)
Px <- 1:3
rates <- Dx / Yx
rates

Using these data, fit the same model with log link as in section 1.6.2, assuming a common single hazard \(\lambda\) for the separate periods. Compare the result from the previous ones

m3 <- glm(cbind(Dx, Yx) ~ 1, family = poisreg(link = log))
ci.exp(m3)

Now test whether the rates are the same in the three periods: Try to fit a model with the period as a factor in the model:

mp <- glm(cbind(Dx, Yx) ~ factor(Px), family = poisreg(link = log))
ci.exp(mp)

Compare the goodness-of-fit of the two models using anova() with the argument test="Chisq":

anova(m3, mp, test = "Chisq")

Compare the test statistic to the deviance of the model mp. – What is the deviance indicating?

6.5 Analysis of rate ratio

We now switch to comparison of two rates \(\lambda_1\) and \(\lambda_0\), i.e. the hazard in an exposed group vs. that in an unexposed one.

Consider first estimation of the hazard ratio or the underlying true rate ratio \(\rho = \lambda_1/\lambda_0\) between the groups. Suppose we have pertinent empirical data (cases and person-times) from both groups, \((D_1,Y_1)\) and \((D_0,Y_0)\). The point estimate of \(\rho\) is the empirical incidence rate ratio \[ \widehat{\rho} = RR = \frac{\widehat\lambda_1}{\widehat\lambda_0} = \frac{D_1/Y_1}{D_0/Y_0} \]

Suppose you have \(15\) events during \(5532\) person-years in an unexposed group and \(28\) events during \(4783\) person-years in an exposed group:

Calculate the incidence rates in the two groups, their ratio, and the CI of the true hazard ratio \(\rho\) by direct application of the above formulae:

D0 <- 15
D1 <- 28
Y0 <- 5.532
Y1 <- 4.783

Now achieve this using a Poisson model. For that we first combine the group-specific numbers into pertinent vectors and specify a factor to represent the contrast between the exposed and the unexposed group

D <- c(D0, D1)
Y <- c(Y0, Y1)
expos <- 0:1
mm <- glm(cbind(D, Y) ~ factor(expos), family = poisreg(link = log))

What do the parameters mean in this model?

You can extract the estimation results for exponentiated parameters in two ways, as before:

ci.exp(mm)
ci.lin(mm, Exp = TRUE)[, 5:7]

6.6 Analysis of rate difference

For the hazard difference \(\delta = \lambda_1 - \lambda_0\), the natural estimator is the incidence rate difference \[ \widehat\delta = \widehat\lambda_1 - \widehat\lambda_0 = D_1/Y_1 - D_0/Y_0 = \mbox{RD} . \] Its variance is just the sum of the variances of the two rates \[ var(RD) = var(\widehat\lambda_1 ) + var( \widehat\lambda_0 ) = D_1/Y_1^2 + D_0/Y_0^2 \]

Use this formula to compute the point estimate of the rate difference \(\lambda\) and a 95% confidence interval for it:

R0 <- D0 / Y0
R1 <- D1 / Y1
RD <- diff(D / Y)
SED <- sqrt(sum(D / Y^2))
c(R1, R0, RD, SED, RD + c(-1, 1) * 1.96 * SED)

Verify that this is the confidence interval you get when you fit an additive model (obtained by identity link) with exposure as a factor:

ma <- glm(cbind(D, Y) ~ factor(expos),
  family = poisreg(link = identity)
)
ci.lin(ma)[, c(1, 5, 6)]

6.7 Binary regression

Explore the factors associated with risk of low birth weight in 500 singleton births in a London Hospital. Indicator (lowbw) for birth weight less than 2500 g. Data available from the Epi package. Factors of interest are maternal hypertension (hyp), mother’s age at birth over 35 years and sex of the baby.

Load the Epi package and the data set and look at its content

library(dplyr)
library(Epi)
data(births)
str(births)

Because all variables are numeric we need first to do a little housekeeping. Two of them are directly converted into factors, and categorical versions are created of two continuous variables by function cut().

births$hyp <- factor(births$hyp, labels = c("normal", "hyper"))
births$sex <- factor(births$sex, labels = c("M", "F"))
births$gest4 <- cut(births$gestwks,
  breaks = c(20, 35, 37, 39, 45), right = FALSE
)
births$maged <- ifelse(births$matage < 35, 0, 1)

Cross tabulate (dplyr) counts of children by hypertension and low birth weight. calculate (mutate) proportions of low birth weigth children by hypertension.

births %>%
  count(hyp, lowbw) %>%
  group_by(hyp) %>% # now required with changes to dplyr::count()
  mutate(prop = prop.table(n))

Estimate relative risk of low birth weight for mothers with hypertension compared to those without using binary regression.

m <- glm(lowbw ~ hyp, family = binomial(link = log), data = births)
ci.exp(m)

Adjust relative risk of low birth and hypertension with the sex of children

m <- glm(lowbw ~ sex + hyp, family = binomial(link = log), data = births)
ci.exp(m)

Adjust relative risk of low birth and hypertension with the sex of children and mother beeing over 35 years.

m <- glm(lowbw ~ maged + sex + hyp, family = binomial(link = log), data = births)
ci.exp(m)

6.8 Optional/Homework: Hand calculations and calculations using matrix tools

NB. This subsection requires some familiarity with matrix algebra. Do this only after you have done the other exercises of this session.

First some basic hand calculations.

Suppose \(15\) outcome events are observed during \(5532\) person-years in a given study cohort. Let’s use R as a simple desk calculator to estimate the underlying hazard rate \(\lambda\) (in 1000 person-years; therefore 5.532) and to get the first version of an approximate confidence interval:

library(Epi)
options(digits = 4) #  to cut down decimal points in the output

D <- 15
Y <- 5.532 # thousands of years!
rate <- D / Y
SE.rate <- rate / sqrt(D)
c(rate, SE.rate, rate + c(-1.96, 1.96) * SE.rate)

Compute now the approximate confidence interval using the method based on log-transformation and compare the result with that in the previous item.

SE.logr <- 1 / sqrt(D)
EF <- exp(1.96 * SE.logr)
c(log(rate), SE.logr)
c(rate, EF, rate / EF, rate * EF)

Calculate the incidence rates in the two groups, their ratio, and the
CI of the true hazard ratio \(\rho\) by direct application of the above formulae:

D0 <- 15
D1 <- 28
Y0 <- 5.532
Y1 <- 4.783
R1 <- D1 / Y1
R0 <- D0 / Y0
RR <- R1 / R0
SE.lrr <- sqrt(1 / D0 + 1 / D1)
EF <- exp(1.96 * SE.lrr)
c(R1, R0, RR, RR / EF, RR * EF)

Explore the function ci.mat(), which lets you use matrix multiplication (operator '%*%' in R) to produce a confidence interval from an estimate and its standard error (or CIs from whole columns of estimates and SEs):

ci.mat
ci.mat()

As you see, this function returns a \(2\times 3\) matrix (2 rows, 3 columns) containing familiar numbers.

When you combine the single rate and its standard error into a row vector of length 2, i.e. a \(1\times 2\) matrix, and multiply this by the \(2\times 3\) matrix above, the computation returns a \(1\times 3\) matrix containing the point estimate and the confidence limit.

Apply this method to the single rate calculations in 1.6.1, first creating the \(1 \times 2\) matrix and then performing the matrix multiplication.

rateandSE <- c(rate, SE.rate)
rateandSE
rateandSE %*% ci.mat()

When the confidence interval is based on the log-rate and its standard error, the result is obtained by appropriate application of the exp-function on the pertinent matrix product

lograndSE <- c(log(rate), SE.logr)
lograndSE
exp(lograndSE %*% ci.mat())

For computing the rate ratio and its CI as in 1.6.5, matrix multiplication with ci.mat() should give the same result as there:

exp(c(log(RR), SE.lrr) %*% ci.mat())

The main argument in function ci.mat() is alpha, which sets the confidence level: \(1 - \alpha\). The default value is alpha = 0.05, corresponding to the level \(1 - 0.05\) = 95%. If you wish to get the confidence interval for the rate ratio at the 90% level (= \(1-0.1\)), for instance, you may proceed as follows:

ci.mat(alpha = 0.1)
exp(c(log(RR), SE.lrr) %*% ci.mat(alpha = 0.1))

Now achieve this using a Poisson model. For that we first combine the group-specific numbers into pertinent vectors and specify a factor to represent the contrast between the exposed and the unexposed group

D <- c(D0, D1)
Y <- c(Y0, Y1)
expos <- 0:1

Look again to the model used to analyse the rate ratio in. Often one would like to get simultaneously both the rates and the ratio between them. This can be achieved in one go using the contrast matrix argument ctr.mat to ci.lin() or ci.exp(). Try:

CM <- rbind(c(1, 0), c(1, 1), c(0, 1))
rownames(CM) <- c("rate 0", "rate 1", "RR 1 vs. 0")
CM
mm <- glm(D ~ factor(expos),
  family = poisson(link = log), offset = log(Y)
)
ci.exp(mm, ctr.mat = CM)

Use the same machinery to the additive model to get the rates and the rate-difference in one go. Note that the annotation of the resulting estimates are via the column-names of the contrast matrix.

rownames(CM) <- c("rate 0", "rate 1", "RD 1 vs. 0")
ma <- glm(cbind(D, Y) ~ factor(expos),
  family = poisreg(link = identity)
)
ci.lin(ma, ctr.mat = CM)[, c(1, 5, 6)]