Epi/vignettes/flup.tex

%\VignetteIndexEntry{Follow-up data with the Lexis functions in Epi}

\documentclass[a4paper,dvipsnames,twoside,12pt]{report}
\newcommand{\Title}{Follow-up data with the\\ \texttt{Lexis} functions in
  \texttt{Epi}}
\newcommand{\Tit}{Follow-up with \texttt{Lexis}}
\newcommand{\Version}{Version 6}
\newcommand{\Dates}{August 2019}
\newcommand{\Where}{SDCC}
\newcommand{\Homepage}{\url{http://bendixcarstensen.com/} }
\newcommand{\Faculty}{\begin{tabular}{rl}
Bendix Carstensen
  & Steno Diabetes Center Copenhagen, Gentofte, Denmark\\
  & {\small \& Department of Biostatistics,
               University of Copenhagen} \\
  & \texttt{b@bxc.dk} \\
  & \url{http://BendixCarstensen.com} \\[1em]
                      \end{tabular}}

\input{topreport}
\renewcommand{\rwpre}{./flup}

\chapter*{Introduction}
\addcontentsline{toc}{chapter}{Introduction}

This is an introduction to the \texttt{Lexis} machinery in the
\texttt{Epi} package. The machinery is intended for representation and
manipulation of follow-up data (event history data) from studies where
exact dates of events are known. It accommodates follow-up through
multiple states and on multiple time scales.

This vignette uses an example from the \texttt{Epi} package to
illustrate the set-up of a simple \texttt{Lexis} object (a data frame
of follow-up intervals), as well as the subdivision of follow-up
intervals needed for multistate representation and analysis of
transition rates.

The first chapter is exclusively on manipulation of the follow-up
representation, but it points to the subsequent chapter where analysis
is based on a \texttt{Lexis} representation with very small follow-up
intervals.

Chapter 2 uses analysis of mortality rates among Danish diabetes
patients (available in the \texttt{Epi} package) currently on insulin
treatment or not to illustrate the use of the the \texttt{Lexis}
machinery.

I owe much thanks to my colleague Lars Jorge Diaz for careful reading
and many constructive suggestions.

\section{History}

The \texttt{Lexis} machinery in the \texttt{Epi} package was first
conceived by Martyn Plummer\cite{Plummer.2011,Carstensen.2011a}, and
since its first appearance in the \texttt{Epi} package in 2008 it has
been expanded with a number of utilities. An overview of these can be
found in the last chapter of this note: ``\texttt{Lexis} functions''.

\chapter{Representation of follow-up data in the \texttt{Epi} package}

In the \texttt{Epi}-package, follow-up data is represented by adding
some extra variables to a data frame. Such a data frame is called a
\texttt{Lexis} object. The tools for handling follow-up data then use
the structure of this for special plots, tabulations and modeling.

Follow-up data basically consists of a time of entry, a time of exit
and an indication of the status at exit (normally either ``alive'' or
``dead'') for each person. Implicitly is also assumed a status
\emph{during} the follow-up (usually ``alive'').

\begin{figure}[htbp]
  \centering
\setlength{\unitlength}{1pt}
\begin{picture}(210,70)(0,75)
%\scriptsize
\thicklines
 \put(  0,80){\makebox(0,0)[r]{Age-scale}}
 \put( 50,80){\line(1,0){150}}
 \put( 50,80){\line(0,1){5}}
 \put(100,80){\line(0,1){5}}
 \put(150,80){\line(0,1){5}}
 \put(200,80){\line(0,1){5}}
 \put( 50,77){\makebox(0,0)[t]{35}}
 \put(100,77){\makebox(0,0)[t]{40}}
 \put(150,77){\makebox(0,0)[t]{45}}
 \put(200,77){\makebox(0,0)[t]{50}}

 \put(  0,115){\makebox(0,0)[r]{Follow-up}}

 \put( 80,105){\makebox(0,0)[r]{\small Two}}
 \put( 90,105){\line(1,0){87}}
 \put( 90,100){\line(0,1){10}}
 \put(100,100){\line(0,1){10}}
 \put(150,100){\line(0,1){10}}
 \put(180,105){\circle{6}}
 \put( 95,110){\makebox(0,0)[b]{1}}
 \put(125,110){\makebox(0,0)[b]{5}}
 \put(165,110){\makebox(0,0)[b]{3}}

 \put( 50,130){\makebox(0,0)[r]{\small One}}
 \put( 60,130){\line(1,0){70}}
 \put( 60,125){\line(0,1){10}}
 \put(100,125){\line(0,1){10}}
 \put(130,130){\circle*{6}}
 \put( 80,135){\makebox(0,0)[b]{4}}
 \put(115,135){\makebox(0,0)[b]{3}}
\end{picture}
  \caption{\it Follow-up of two persons}
  \label{fig:fu2}
\end{figure}

\section{Time scales}

A time scale is a variable that varies deterministically \emph{within}
each person during follow-up, \textit{e.g.}:
\begin{itemize}
  \item Age
  \item Calendar time
  \item Time since start of treatment
  \item Time since relapse
\end{itemize}
All time scales advance at the same pace, so the time followed is the
same on all time scales. Therefore, it will suffice to use only the
entry point on each of the time scales, for example:
\begin{itemize}
  \item Age at entry
  \item Date of entry
  \item Time at treatment (\emph{at} treatment this is 0)
  \item Time at relapse (\emph{at} relapse this is 0)
\end{itemize}
For illustration we need to load the \texttt{Epi} package:
\begin{Schunk}
\begin{Sinput}
> library(Epi)
> print( sessionInfo(), l=F )
\end{Sinput}
\begin{Soutput}
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/openblas-base/libopenblas.so.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

attached base packages:
[1] utils     datasets  graphics  grDevices stats     methods   base

other attached packages:
[1] Epi_2.43

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        magrittr_1.5      splines_3.6.3     MASS_7.3-53
 [5] tidyselect_0.2.5  lattice_0.20-41   R6_2.4.1          rlang_0.4.4
 [9] plyr_1.8.5        dplyr_0.8.4       cmprsk_2.2-7      tools_3.6.3
[13] parallel_3.6.3    grid_3.6.3        data.table_1.12.0 nlme_3.1-149
[17] mgcv_1.8-33       survival_3.2-7    assertthat_0.2.1  tibble_2.1.3
[21] etm_1.0.4         numDeriv_2016.8-1 crayon_1.3.4      Matrix_1.2-18
[25] purrr_0.3.0       glue_1.3.1        compiler_3.6.3    pillar_1.4.3
[29] zoo_1.8-4         pkgconfig_2.0.3
\end{Soutput}
\end{Schunk}
In the \texttt{Epi} package, follow-up in a cohort is represented in a
\texttt{Lexis} object. As mentioned, a \texttt{Lexis} object is a data
frame with some extra structure representing the follow-up. For the
\texttt{DMlate} data --- follow-up of diabetes patients in Denmark
recording date of birth, date of diabetes, date of insulin use, date
of first oral drug use and date of death --- we can construct a
\texttt{Lexis} object by:
\begin{Schunk}
\begin{Sinput}
> data( DMlate )
> head( DMlate )
\end{Sinput}
\begin{Soutput}
       sex    dobth     dodm    dodth    dooad doins      dox
50185    F 1940.256 1998.917       NA       NA    NA 2009.997
307563   M 1939.218 2003.309       NA 2007.446    NA 2009.997
294104   F 1918.301 2004.552       NA       NA    NA 2009.997
336439   F 1965.225 2009.261       NA       NA    NA 2009.997
245651   M 1932.877 2008.653       NA       NA    NA 2009.997
216824   F 1927.870 2007.886 2009.923       NA    NA 2009.923
\end{Soutput}
\begin{Sinput}
> dmL <- Lexis( entry = list( per=dodm,
+                             age=dodm-dobth,
+                             tfD=0 ),
+                exit = list( per=dox ),
+         exit.status = factor( !is.na(dodth), labels=c("DM","Dead") ),
+                data = DMlate )
\end{Sinput}
\begin{Soutput}
NOTE: entry.status has been set to "DM" for all.
NOTE: Dropping  4  rows with duration of follow up < tol
\end{Soutput}
\begin{Sinput}
> timeScales(dmL)
\end{Sinput}
\begin{Soutput}
[1] "per" "age" "tfD"
\end{Soutput}
\end{Schunk}
(The excluded persons are persons with date of diabetes equal to date
of death.)

The \texttt{entry} argument is a \emph{named} list with the entry
points on each of the time scales we want to use. It defines the names
of the time scales and the entry points of the follow-up of each
person. The \texttt{exit} argument gives the exit time on \emph{one}
of the time scales, so the name of the element in this list must match
one of the names of the \texttt{entry} list. This is sufficient,
because the follow-up time on all time scales is the same, in this
case $\mathtt{dox}$-$\mathtt{dodm}$.

The \texttt{exit.status} is a categorical variable (a \emph{factor})
that indicates the exit status --- in this case whether the person
(still) is in state \texttt{DM} or exits to \texttt{Dead} at the end
of follow-up. In principle we should also indicate the
\texttt{entry.status}, but the default is to assume that all persons
enter in the \emph{first} of the mentioned \texttt{exit.state}s ---
in this case \texttt{DM}, because $\mathtt{FALSE}<\mathtt{TRUE}$.

Now take a look at the result:
\begin{Schunk}
\begin{Sinput}
> str( dmL )
\end{Sinput}
\begin{Soutput}
Classes ‘Lexis’ and 'data.frame':	9996 obs. of  14 variables:
 $ per    : num  1999 2003 2005 2009 2009 ...
 $ age    : num  58.7 64.1 86.3 44 75.8 ...
 $ tfD    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ lex.dur: num  11.08 6.689 5.446 0.736 1.344 ...
 $ lex.Cst: Factor w/ 2 levels "DM","Dead": 1 1 1 1 1 1 1 1 1 1 ...
 $ lex.Xst: Factor w/ 2 levels "DM","Dead": 1 1 1 1 1 2 1 1 2 1 ...
 $ lex.id : int  1 2 3 4 5 6 7 8 9 10 ...
 $ sex    : Factor w/ 2 levels "M","F": 2 1 2 2 1 2 1 1 2 1 ...
 $ dobth  : num  1940 1939 1918 1965 1933 ...
 $ dodm   : num  1999 2003 2005 2009 2009 ...
 $ dodth  : num  NA NA NA NA NA ...
 $ dooad  : num  NA 2007 NA NA NA ...
 $ doins  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ dox    : num  2010 2010 2010 2010 2010 ...
 - attr(*, "time.scales")= chr  "per" "age" "tfD"
 - attr(*, "time.since")= chr  "" "" ""
 - attr(*, "breaks")=List of 3
  ..$ per: NULL
  ..$ age: NULL
  ..$ tfD: NULL
\end{Soutput}
\begin{Sinput}
> head( dmL )[,1:10]
\end{Sinput}
\begin{Soutput}
            per      age tfD    lex.dur lex.Cst lex.Xst lex.id sex    dobth     dodm
50185  1998.917 58.66119   0 11.0800821      DM      DM      1   F 1940.256 1998.917
307563 2003.309 64.09035   0  6.6885695      DM      DM      2   M 1939.218 2003.309
294104 2004.552 86.25051   0  5.4455852      DM      DM      3   F 1918.301 2004.552
336439 2009.261 44.03559   0  0.7364819      DM      DM      4   F 1965.225 2009.261
245651 2008.653 75.77550   0  1.3442847      DM      DM      5   M 1932.877 2008.653
216824 2007.886 80.01643   0  2.0369610      DM    Dead      6   F 1927.870 2007.886
\end{Soutput}
\end{Schunk}
The \texttt{Lexis} object \texttt{dmL} has a variable for each
time scale which is the entry point on this time scale. The follow-up
time is in the variable \texttt{lex.dur} (\texttt{dur}ation).
Note that the exit status is in the variable \texttt{lex.Xst}
(e\texttt{X}it \texttt{st}ate. The variable \texttt{lex.Cst} is the
state where the follow-up takes place (\texttt{C}urrent
\texttt{st}ate), in this case \texttt{DM} (alive with diabetes) for
all persons. This implies that \emph{censored} observations are
characterized by having $\mathtt{lex.Cst}=\mathtt{lex.Xst}$.

There is a \texttt{summary} function for \texttt{Lexis} objects that
lists the number of transitions and records as well as the total amount
of follow-up time; it also (optionally) prints information about the names of the
variables that constitute the time scales:
\begin{Schunk}
\begin{Sinput}
> summary.Lexis( dmL, timeScales=TRUE )
\end{Sinput}
\begin{Soutput}
Transitions:
     To
From   DM Dead  Records:  Events: Risk time:  Persons:
  DM 7497 2499      9996     2499   54273.27      9996

Timescales:
per age tfD
 ""  ""  ""
\end{Soutput}
\end{Schunk}
It is possible to get a visualization of the follow-up along the
time scales chosen by using the \texttt{plot} method for \texttt{Lexis}
objects. \texttt{dmL} is an object of \emph{class} \texttt{Lexis}, so
using the function \texttt{plot()} on it means that \R\ will look for
the function \texttt{plot.Lexis} and use this function.
\begin{Schunk}
\begin{Sinput}
> plot( dmL )
\end{Sinput}
\end{Schunk}
The function allows quite a bit of control over the output, and a
\texttt{points.Lexis} function allows plotting of the endpoints of
follow-up:
\begin{Schunk}
\begin{Sinput}
> par( mar=c(3,3,1,1), mgp=c(3,1,0)/1.6 )
> plot( dmL, 1:2, lwd=1, col=c("blue","red")[dmL$sex],
+       grid=TRUE, lty.grid=1, col.grid=gray(0.7),
+       xlim=1960+c(0,60), xaxs="i",
+       ylim=  40+c(0,60), yaxs="i", las=1 )
> points( dmL, 1:2, pch=c(NA,3)[dmL$lex.Xst],
+         col="lightgray", lwd=3, cex=0.3 )
> points( dmL, 1:2, pch=c(NA,3)[dmL$lex.Xst],
+         col=c("blue","red")[dmL$sex], lwd=1, cex=0.3 )
> box(bty='o')
\end{Sinput}
\end{Schunk}
In the above code you will note that the values of the arguments
\texttt{col} and \texttt{pch} are indexed by factors, using the
convention in \R\ that the index is taken as \emph{number of the level}
of the supplied factor. Thus \texttt{c("blue","red")[dmL\$sex]} is
\texttt{"blue"} when \texttt{sex} is \texttt{M} (the first level).

The results of these two plotting commands are in figure
\ref{fig:Lexis-diagram}, p. \pageref{fig:Lexis-diagram}.
\begin{figure}[tb]
\centering
\includegraphics[width=0.35\textwidth]{flup-dmL1}
\includegraphics[width=0.63\textwidth]{flup-dmL2}
\caption{\it Lexis diagram of the \textrm{\tt DMlate} dataset; left
  panel is the default version, right panel: plot with some bells and
  whistles. The red lines are for women, blue for men, crosses
  indicate deaths.}
\label{fig:Lexis-diagram}
\end{figure}

\section{Splitting the follow-up time along a time scale}

In next chapter we shall conduct statistical analysis of mortality
rates, and a prerequisite for parametric analysis of rates is that
follow-up time is subdivided in smaller intervals, where we can
reasonably assume that rates are constant.

The follow-up time in a cohort can be subdivided (``split'') along a
time scale, for example current age. This is achieved by the
\texttt{splitLexis} (note that it is \emph{not} called
\texttt{split.Lexis}). This requires that the time scale and the
breakpoints on this time scale are supplied. Try:
\begin{Schunk}
\begin{Sinput}
> dmS1 <- splitLexis( dmL, "age", breaks=seq(0,100,5) )
> summary( dmL )
\end{Sinput}
\begin{Soutput}
Transitions:
     To
From   DM Dead  Records:  Events: Risk time:  Persons:
  DM 7497 2499      9996     2499   54273.27      9996
\end{Soutput}
\begin{Sinput}
> summary( dmS1 )
\end{Sinput}
\begin{Soutput}
Transitions:
     To
From    DM Dead  Records:  Events: Risk time:  Persons:
  DM 18328 2499     20827     2499   54273.27      9996
\end{Soutput}
\end{Schunk}
We see that the number of persons and events and the amount of
follow-up is the same in the two data sets; only the number of records
differ --- the extra records all have \texttt{lex.Cst}=\texttt{DM} and
\texttt{lex.Xst}=\texttt{DM}.

To see how records are split for each individual, it is useful to list
the results for a few individuals (whom we selected with a view to the
illustrative usefulness):
\begin{Schunk}
\begin{Sinput}
> wh.id <- c(9,27,52,484)
> subset( dmL , lex.id %in% wh.id )[,1:10]
\end{Sinput}
\begin{Soutput}
            per      age tfD   lex.dur lex.Cst lex.Xst lex.id sex    dobth     dodm
430048 1998.956 61.87269   0  9.508556      DM    Dead      9   F 1937.083 1998.956
22671  2000.042 52.71184   0  9.954825      DM      DM     27   M 1947.331 2000.042
338459 1998.249 61.85626   0 11.748118      DM      DM     52   F 1936.393 1998.249
274124 1998.260 62.37919   0 10.929500      DM    Dead    484   F 1935.881 1998.260
\end{Soutput}
\begin{Sinput}
> subset( dmS1, lex.id %in% wh.id )[,1:10]
\end{Sinput}
\begin{Soutput}
     lex.id      per      age      tfD  lex.dur lex.Cst lex.Xst sex    dobth     dodm
14        9 1998.956 61.87269 0.000000 3.127310      DM      DM   F 1937.083 1998.956
15        9 2002.083 65.00000 3.127310 5.000000      DM      DM   F 1937.083 1998.956
16        9 2007.083 70.00000 8.127310 1.381246      DM    Dead   F 1937.083 1998.956
54       27 2000.042 52.71184 0.000000 2.288159      DM      DM   M 1947.331 2000.042
55       27 2002.331 55.00000 2.288159 5.000000      DM      DM   M 1947.331 2000.042
56       27 2007.331 60.00000 7.288159 2.666667      DM      DM   M 1947.331 2000.042
108      52 1998.249 61.85626 0.000000 3.143737      DM      DM   F 1936.393 1998.249
109      52 2001.393 65.00000 3.143737 5.000000      DM      DM   F 1936.393 1998.249
110      52 2006.393 70.00000 8.143737 3.604381      DM      DM   F 1936.393 1998.249
1004    484 1998.260 62.37919 0.000000 2.620808      DM      DM   F 1935.881 1998.260
1005    484 2000.881 65.00000 2.620808 5.000000      DM      DM   F 1935.881 1998.260
1006    484 2005.881 70.00000 7.620808 3.308693      DM    Dead   F 1935.881 1998.260
\end{Soutput}
\end{Schunk}
The resulting object, \texttt{dmS1}, is again a \texttt{Lexis} object,
and the follow-up may be split further along another time scale, for
example diabetes duration, \texttt{tfD}. Subsequently we list the
results for the chosen individuals:
\begin{Schunk}
\begin{Sinput}
> dmS2 <- splitLexis( dmS1, "tfD", breaks=c(0,1,2,5,10,20,30,40) )
> subset( dmS2, lex.id %in% wh.id )[,1:10]
\end{Sinput}
\begin{Soutput}
     lex.id      per      age       tfD   lex.dur lex.Cst lex.Xst sex    dobth     dodm
31        9 1998.956 61.87269  0.000000 1.0000000      DM      DM   F 1937.083 1998.956
32        9 1999.956 62.87269  1.000000 1.0000000      DM      DM   F 1937.083 1998.956
33        9 2000.956 63.87269  2.000000 1.1273101      DM      DM   F 1937.083 1998.956
34        9 2002.083 65.00000  3.127310 1.8726899      DM      DM   F 1937.083 1998.956
35        9 2003.956 66.87269  5.000000 3.1273101      DM      DM   F 1937.083 1998.956
36        9 2007.083 70.00000  8.127310 1.3812457      DM    Dead   F 1937.083 1998.956
111      27 2000.042 52.71184  0.000000 1.0000000      DM      DM   M 1947.331 2000.042
112      27 2001.042 53.71184  1.000000 1.0000000      DM      DM   M 1947.331 2000.042
113      27 2002.042 54.71184  2.000000 0.2881588      DM      DM   M 1947.331 2000.042
114      27 2002.331 55.00000  2.288159 2.7118412      DM      DM   M 1947.331 2000.042
115      27 2005.042 57.71184  5.000000 2.2881588      DM      DM   M 1947.331 2000.042
116      27 2007.331 60.00000  7.288159 2.6666667      DM      DM   M 1947.331 2000.042
229      52 1998.249 61.85626  0.000000 1.0000000      DM      DM   F 1936.393 1998.249
230      52 1999.249 62.85626  1.000000 1.0000000      DM      DM   F 1936.393 1998.249
231      52 2000.249 63.85626  2.000000 1.1437372      DM      DM   F 1936.393 1998.249
232      52 2001.393 65.00000  3.143737 1.8562628      DM      DM   F 1936.393 1998.249
233      52 2003.249 66.85626  5.000000 3.1437372      DM      DM   F 1936.393 1998.249
234      52 2006.393 70.00000  8.143737 1.8562628      DM      DM   F 1936.393 1998.249
235      52 2008.249 71.85626 10.000000 1.7481177      DM      DM   F 1936.393 1998.249
2084    484 1998.260 62.37919  0.000000 1.0000000      DM      DM   F 1935.881 1998.260
2085    484 1999.260 63.37919  1.000000 1.0000000      DM      DM   F 1935.881 1998.260
2086    484 2000.260 64.37919  2.000000 0.6208077      DM      DM   F 1935.881 1998.260
2087    484 2000.881 65.00000  2.620808 2.3791923      DM      DM   F 1935.881 1998.260
2088    484 2003.260 67.37919  5.000000 2.6208077      DM      DM   F 1935.881 1998.260
2089    484 2005.881 70.00000  7.620808 2.3791923      DM      DM   F 1935.881 1998.260
2090    484 2008.260 72.37919 10.000000 0.9295003      DM    Dead   F 1935.881 1998.260
\end{Soutput}
\end{Schunk}
A more efficient (and more intuitive) way of making this double split
is to use the \texttt{splitMulti} function from the \texttt{popEpi}
package:
\begin{Schunk}
\begin{Sinput}
> library( popEpi )
> dmM <- splitMulti( dmL, age = seq(0,100,5),
+                         tfD = c(0,1,2,5,10,20,30,40),
+                    drop = FALSE )
> summary( dmS2 )
\end{Sinput}
\begin{Soutput}
Transitions:
     To
From    DM Dead  Records:  Events: Risk time:  Persons:
  DM 40897 2499     43396     2499   54273.27      9996
\end{Soutput}
\begin{Sinput}
> summary( dmM )
\end{Sinput}
\begin{Soutput}
Transitions:
     To
From    DM Dead  Records:  Events: Risk time:  Persons:
  DM 40897 2499     43396     2499   54273.27      9996
\end{Soutput}
\end{Schunk}
Note we used the argument \texttt{drop=FALSE} which will retain
follow-up also outside the window defined by the breaks. Otherwise the
default for \texttt{splitMulti} would be to drop follow-up outside
\texttt{age} [0,100] and \texttt{tfD} [0,40]. This clipping behaviour
is not available in \texttt{splitLexis}, nevertheless this may be
exactly what we want in some situations.

So we see that the two ways of splitting data yields the same amount of
follow-up, but the results are not identical:
\begin{Schunk}
\begin{Sinput}
> identical( dmS2, dmM )
\end{Sinput}
\begin{Soutput}
[1] FALSE
\end{Soutput}
\begin{Sinput}
> class( dmS2 )
\end{Sinput}
\begin{Soutput}
[1] "Lexis"      "data.frame"
\end{Soutput}
\begin{Sinput}
> class( dmM )
\end{Sinput}
\begin{Soutput}
[1] "Lexis"      "data.table" "data.frame"
\end{Soutput}
\end{Schunk}
As we see, this is because the \texttt{dmM} object also is a
\texttt{data.table} object; the \texttt{splitMulti} uses the
\texttt{data.table} machinery which makes the splitting substantially
faster --- this is of particular interest if you operate on large data
sets ($>100,000$ records).

Thus the recommended way of splitting follow-up time is by
\texttt{splitMulti}. But you should be aware that the result is a
\texttt{data.table} object, which in some circumstances behaves
slightly different from \texttt{data.frame}s. See the manual for
\texttt{data.table}.

\section{Cutting follow up time at dates of intermediate events}

If we have a recording of the date of a specific event as for example
recovery or relapse, we may classify follow-up time as being before or
after this intermediate event, but it requires that follow-up records
that straddle the event be cut in two and placed in separate records,
one representing follow-up \emph{before} the intermediate event, and another
representing follow-up \emph{after} the intermediate event. This is
achieved with the function \texttt{cutLexis}, which takes three
arguments: the time point of the intermediate event, the time scale
that this point refers to, and the value of the (new) state following
the date. Optionally, we may also define a new time scale with the
argument \texttt{new.scale=}.

We are interested in the time before and after inception of insulin
use, which occurs at the date \texttt{doins}:
\begin{Schunk}
\begin{Sinput}
> whc <- c(names(dmL)[1:7],"dodm","doins") # WHich Columns do we want to see?
> subset( dmL, lex.id %in% wh.id )[,whc]
\end{Sinput}
\begin{Soutput}
            per      age tfD   lex.dur lex.Cst lex.Xst lex.id     dodm    doins
430048 1998.956 61.87269   0  9.508556      DM    Dead      9 1998.956       NA
22671  2000.042 52.71184   0  9.954825      DM      DM     27 2000.042       NA
338459 1998.249 61.85626   0 11.748118      DM      DM     52 1998.249 2004.804
274124 1998.260 62.37919   0 10.929500      DM    Dead    484 1998.260 2003.960
\end{Soutput}
\begin{Sinput}
> dmC <- cutLexis( data = dmL,
+                   cut = dmL$doins,
+             timescale = "per",
+             new.state = "Ins",
+             new.scale = "tfI",
+      precursor.states = "DM" )
> whc <- c(names(dmL)[1:8],"doins") # WHich Columns do we want to see?
> subset( dmC, lex.id %in% wh.id )[,whc]
\end{Sinput}
\begin{Soutput}
           per      age      tfD  lex.dur lex.Cst lex.Xst lex.id sex    doins
9     1998.956 61.87269 0.000000 9.508556      DM    Dead      9   F       NA
27    2000.042 52.71184 0.000000 9.954825      DM      DM     27   M       NA
52    1998.249 61.85626 0.000000 6.554415      DM     Ins     52   F 2004.804
10048 2004.804 68.41068 6.554415 5.193703     Ins     Ins     52   F 2004.804
484   1998.260 62.37919 0.000000 5.700205      DM     Ins    484   F 2003.960
10480 2003.960 68.07940 5.700205 5.229295     Ins    Dead    484   F 2003.960
\end{Soutput}
\end{Schunk}
(The \texttt{precursor.states=} argument is explained below).

Note that the process of cutting time is simplified by having all
types of events referred to the calendar time scale. This is a
generally applicable advice in handling follow-up data: Get all event
times as \emph{dates}, location of events and follow-up on other time
scales can then easily be derived from this.

Note
that individual 52 has had his follow-up cut at 6.55 years from
diabetes diagnosis and individual 484 at 5.70 years from diabetes
diagnosis. This dataset could then be split along the time scales as we
did before with \texttt{dmL}.

The result of this can also be achieved by cutting the split dataset
\texttt{dmS2} instead of \texttt{dmL}:
\begin{Schunk}
\begin{Sinput}
> dmS2C <- cutLexis( data = dmS2,
+                     cut = dmS2$doins,
+               timescale = "per",
+               new.state = "Ins",
+               new.scale = "tfI",
+        precursor.states = "DM" )
> subset( dmS2C, lex.id %in% wh.id )[,whc]
\end{Sinput}
\begin{Soutput}
           per      age       tfD   lex.dur lex.Cst lex.Xst lex.id sex    doins
31    1998.956 61.87269  0.000000 1.0000000      DM      DM      9   F       NA
32    1999.956 62.87269  1.000000 1.0000000      DM      DM      9   F       NA
33    2000.956 63.87269  2.000000 1.1273101      DM      DM      9   F       NA
34    2002.083 65.00000  3.127310 1.8726899      DM      DM      9   F       NA
35    2003.956 66.87269  5.000000 3.1273101      DM      DM      9   F       NA
36    2007.083 70.00000  8.127310 1.3812457      DM    Dead      9   F       NA
111   2000.042 52.71184  0.000000 1.0000000      DM      DM     27   M       NA
112   2001.042 53.71184  1.000000 1.0000000      DM      DM     27   M       NA
113   2002.042 54.71184  2.000000 0.2881588      DM      DM     27   M       NA
114   2002.331 55.00000  2.288159 2.7118412      DM      DM     27   M       NA
115   2005.042 57.71184  5.000000 2.2881588      DM      DM     27   M       NA
116   2007.331 60.00000  7.288159 2.6666667      DM      DM     27   M       NA
229   1998.249 61.85626  0.000000 1.0000000      DM      DM     52   F 2004.804
230   1999.249 62.85626  1.000000 1.0000000      DM      DM     52   F 2004.804
231   2000.249 63.85626  2.000000 1.1437372      DM      DM     52   F 2004.804
232   2001.393 65.00000  3.143737 1.8562628      DM      DM     52   F 2004.804
233   2003.249 66.85626  5.000000 1.5544148      DM     Ins     52   F 2004.804
43629 2004.804 68.41068  6.554415 1.5893224     Ins     Ins     52   F 2004.804
43630 2006.393 70.00000  8.143737 1.8562628     Ins     Ins     52   F 2004.804
43631 2008.249 71.85626 10.000000 1.7481177     Ins     Ins     52   F 2004.804
2084  1998.260 62.37919  0.000000 1.0000000      DM      DM    484   F 2003.960
2085  1999.260 63.37919  1.000000 1.0000000      DM      DM    484   F 2003.960
2086  2000.260 64.37919  2.000000 0.6208077      DM      DM    484   F 2003.960
2087  2000.881 65.00000  2.620808 2.3791923      DM      DM    484   F 2003.960
2088  2003.260 67.37919  5.000000 0.7002053      DM     Ins    484   F 2003.960
45484 2003.960 68.07940  5.700205 1.9206023     Ins     Ins    484   F 2003.960
45485 2005.881 70.00000  7.620808 2.3791923     Ins     Ins    484   F 2003.960
45486 2008.260 72.37919 10.000000 0.9295003     Ins    Dead    484   F 2003.960
\end{Soutput}
\end{Schunk}
Thus it does not matter in which order we use \texttt{splitLexis} and
\texttt{cutLexis}. Mathematicians would say that \texttt{splitLexis}
and \texttt{cutLexis} are commutative.

Note in \texttt{lex.id}=484, that follow-up subsequent to the event is
classified as being in state \texttt{Ins}, but that the final
transition to state \texttt{Dead} is preserved. This is the point of
the \texttt{precursor.states=} argument. It names the states (in this
case \texttt{DM}) that will be over-written by \texttt{new.state} (in
this case \texttt{Ins}), while the state \texttt{Dead} should not be
updated even if it is after the time where the persons moves to state
\texttt{Ins}. In other words, only state \texttt{DM} is a precursor to
state \texttt{Ins}, state \texttt{Dead} is always subsequent to state
\texttt{Ins}.

Note that we defined a new time scale, \texttt{tfI}, using the argument
\texttt{new.scale=tfI}. This has a special status relative to the other
three time scales, it is defined as time since entry into a state,
namely \texttt{Ins}, this is noted in the time scale part of the
summary of \texttt{Lexis} object --- the information sits in the
attribute \texttt{time.since} of the \texttt{Lexis} object, which can
be accessed by the function \texttt{timeSince()} or through the \texttt{summary()}:
\begin{Schunk}
\begin{Sinput}
> summary( dmS2C, timeScales=TRUE )
\end{Sinput}
\begin{Soutput}
Transitions:
     To
From     DM  Ins Dead  Records:  Events: Risk time:  Persons:
  DM  35135 1694 2048     38877     3742   45885.49      9899
  Ins     0 5762  451      6213      451    8387.77      1791
  Sum 35135 7456 2499     45090     4193   54273.27      9996

Timescales:
  per   age   tfD   tfI
   ""    ""    "" "Ins"
\end{Soutput}
\end{Schunk}
Finally we can get a quick overview of the states and transitions by
using \texttt{boxes} --- \texttt{scale.R} scales transition rates to
rates per 1000 PY:
\begin{Schunk}
\begin{Sinput}
> boxes( dmC, boxpos=TRUE, scale.R=1000, show.BE=TRUE )
\end{Sinput}
\end{Schunk}
\insfig{box1}{0.8}{States, person years, transitions and rates in the
  cut dataset. The numbers \emph{in} the boxes are person-years and
  the number of persons \texttt{B}eginning, resp. \texttt{E}nding
  their follow-up in each state (triggered by \textrm{\tt
    show.BE=TRUE}). The numbers at the arrows are the number of
  transitions and transition rates per 1000 (triggered by \textrm{\tt
  scale.R=1000}).}

\chapter{Modeling rates from \texttt{Lexis} objects}

\section{Covariates}

In the dataset \texttt{dmS2C} there are three types of covariates that
can be used to describe mortality rates:
\begin{enumerate}
\item time-dependent covariates
\item time scales
\item fixed covariates
\end{enumerate}

There is only one time-dependent covariate here, namely
\texttt{lex.Cst}, the current state of the person's follow up; it
takes the values \texttt{DM} and \texttt{Ins} according to whether the
person has ever purchased insulin at a given time of follow-up.

The time-scales are obvious candidates for explanatory variables for
the rates, notably age and time from diagnosis (duration of diabetes)
and insulin.

\subsection{Time scales as covariates}

If we want to model the effect of the time scale variables on
occurrence rates, we will for each interval use either the value of
the left endpoint in each interval or the middle. There is a function
\texttt{timeBand} which returns either of these:
\begin{Schunk}
\begin{Sinput}
> timeBand( dmS2C, "age", "middle" )[1:10]
\end{Sinput}
\begin{Soutput}
 [1] 57.5 57.5 62.5 62.5 62.5 67.5 67.5 62.5 67.5 67.5
\end{Soutput}
\begin{Sinput}
> # For nice printing and column labelling we use the data.frame() function:
> data.frame( dmS2C[,c("per","age","tfD","lex.dur")],
+             mid.age=timeBand( dmS2C, "age", "middle" ),
+               mid.t=timeBand( dmS2C, "tfD", "middle" ),
+              left.t=timeBand( dmS2C, "tfD", "left"   ),
+             right.t=timeBand( dmS2C, "tfD", "right"  ),
+              fact.t=timeBand( dmS2C, "tfD", "factor" ) )[1:15,]
\end{Sinput}
\begin{Soutput}
        per      age        tfD    lex.dur mid.age mid.t left.t right.t  fact.t
1  1998.917 58.66119  0.0000000 1.00000000    57.5   0.5      0       1   (0,1]
2  1999.917 59.66119  1.0000000 0.33880903    57.5   1.5      1       2   (1,2]
3  2000.256 60.00000  1.3388090 0.66119097    62.5   1.5      1       2   (1,2]
4  2000.917 60.66119  2.0000000 3.00000000    62.5   3.5      2       5   (2,5]
5  2003.917 63.66119  5.0000000 1.33880903    62.5   7.5      5      10  (5,10]
6  2005.256 65.00000  6.3388090 3.66119097    67.5   7.5      5      10  (5,10]
7  2008.917 68.66119 10.0000000 1.08008214    67.5  15.0     10      20 (10,20]
8  2003.309 64.09035  0.0000000 0.90965092    62.5   0.5      0       1   (0,1]
9  2004.218 65.00000  0.9096509 0.09034908    67.5   0.5      0       1   (0,1]
10 2004.309 65.09035  1.0000000 1.00000000    67.5   1.5      1       2   (1,2]
11 2005.309 66.09035  2.0000000 3.00000000    67.5   3.5      2       5   (2,5]
12 2008.309 69.09035  5.0000000 0.90965092    67.5   7.5      5      10  (5,10]
13 2009.218 70.00000  5.9096509 0.77891855    72.5   7.5      5      10  (5,10]
14 2004.552 86.25051  0.0000000 1.00000000    87.5   0.5      0       1   (0,1]
15 2005.552 87.25051  1.0000000 1.00000000    87.5   1.5      1       2   (1,2]
\end{Soutput}
\end{Schunk}
Note that the values of these functions are characteristics of the
intervals defined by \texttt{breaks=}, \emph{not} the midpoints nor
left or right endpoints of the actual follow-up intervals (which would
be \texttt{tfD} and \texttt{tfD+lex.dur}, respectively).

These functions are intended for modeling time scale variables as
factors (categorical variables) in which case the coding must be
independent of the censoring and mortality pattern --- it should only
depend on the chosen grouping of the time scale. Modeling time scales as
\emph{quantitative} should not be based on these codings but directly
on the values of the time-scale variables, notably the left endpoints
of the intervals.

\subsection{Differences between time scales}

Apparently, the only fixed variable is \texttt{sex}, but formally the
dates of birth (\texttt{dobth}), diagnosis (\texttt{dodm}) and first
insulin purchase (\texttt{doins}) are fixed covariates too. They can
be constructed as origins of time scales referred to the calendar time
scale. Likewise, and possibly of greater interest, we can consider
these origins on the age scale, calculated as the difference between
age and another time scale.

These would then be age at birth (hardly relevant since it is the same
for all persons), age at diabetes
diagnosis and age at insulin treatment.

\subsection{Keeping the relation between time scales}

The midpoint (as well as the right interval endpoint) should be used
with caution if the variable age at diagnosis \texttt{dodm-dobth} is
modeled too; the age at diabetes is logically equal to the difference
between current age (\texttt{age}) and time since diabetes diagnosis
(\texttt{tfD}):
\begin{Schunk}
\begin{Sinput}
> summary( (dmS2$age-dmS2$tfD) - (dmS2$dodm-dmS2$dobth) )
\end{Sinput}
\begin{Soutput}
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      0       0       0       0       0       0
\end{Soutput}
\end{Schunk}
This calculation refers to the \emph{start} of each interval --- which
are in the time scale variables in the \texttt{Lexis} object. But when
using the middle of the intervals, this relationship is not preserved:
\begin{Schunk}
\begin{Sinput}
> summary( timeBand( dmS2, "age", "middle" ) -
+          timeBand( dmS2, "tfD", "middle" ) - (dmS2$dodm-dmS2$dobth) )
\end{Sinput}
\begin{Soutput}
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-7.4870 -2.0862 -0.3765     Inf  1.3641     Inf
\end{Soutput}
\end{Schunk}
If all three variables are to be included in a model, we must make
sure that the \emph{substantial} relationship between the variables be
maintained. One way is to recompute age at diabetes diagnosis from the two
midpoint variables, but more straightforward would be to use the left
endpoint of the intervals, that is the time scale variables in the
\texttt{Lexis} object.

If we dissolve the relationship between the variables \texttt{age},
\texttt{tfD} and age at diagnosis by grouping we may obtain
identifiability of the three separate effects, but it will be at the
price of an arbitrary allocation of a linear trend between them.

For the sake of clarity, consider current age, $a$, duration of
disease, $d$ and age at diagnosis $e$, where
\[
 \text{current age} =
 \text{age at diagnosis} +
 \text{disease duration},
 \quad \text{\ie} \quad a=e+d \quad
 \Leftrightarrow \quad e+d-a=0
\]
If we model the effect of the quantitative variables $a$, $e$ and $d$
on the log-rates by three functions $f$, $g$ and $h$:
$ \log(\lambda)=f(a)+g(d)+h(e) $
then for any $\kappa$:
\begin{align*}
   \log(\lambda) & = f(a)+g(d)+h(e)+\kappa(e+d-a)\\
                 & =
   \big(f(a)-\kappa a \big)+
   \big(g(d)+\kappa d \big)+
   \big(h(e)+\kappa e \big) \\
& = \tilde f(a)+ \tilde g(d)+ \tilde h(e)
\end{align*}
In practical modeling this will turn up as a singular model matrix
with one parameter aliased, corresponding to some arbitrarily chosen
value of $\kappa$ (depending on software conventions for singular
models). This phenomenon is well known from age-period-cohort models.

Thus we see that we can move any slope around between the three terms,
so if we achieve identifiability by using grouping of one of the
variables we will in reality have settled for a particular value of
$\kappa$, without known why we chose just that. The solution is to
resort to predictions which are independent of the particular
parametrization or choose a particular parametrization with explicit constraints.

\section{Modeling of rates}

As mentioned, the purpose of subdividing follow-up data in smaller
intervals is to be able to model effects of time scale variables as
parametric functions. When we split along a time scale we can get
intervals that are as small as we want; if they are sufficiently
small, an assumption of constant rates in each interval becomes
reasonable.

In a model that assumes a constant occurrence rate in each of the
intervals the likelihood contribution from each interval is the same
as the likelihood contribution from a Poisson variate $D$, say, with
mean $\lambda \ell$ where $\lambda$ is the rate and $\ell$ is the
interval length, and where the value of the variate $D$ is 1 or 0
according to whether an event has occurred or not. Moreover, the
likelihood contributions from all follow-up intervals from a single
person are \emph{conditionally} independent (conditional on having
survived till the start of the interval in question). This implies
that the total contribution to the likelihood from a single person is
a product of terms, and hence the same as the likelihood of a number
of independent Poisson terms, one from each interval.

Note that variables are neither Poisson distributed (\eg they can only
ever assume values 0 or 1) nor independent --- it is only the
likelihood for the follow-up data that happens to be the same as the
likelihood from independent Poisson variates. Different models can
have the same likelihood, a model cannot be inferred from the
likelihood.

Parametric modeling of the rates is obtained by using the
\emph{values} of the time scales for each interval as \emph{quantitative}
explanatory variables, using for example splines. And of course also
the values of the fixed covariates and the time-dependent variables
for each interval. Thus the model will be one where the rate is
assumed constant in each (small) interval, but where a parametric form
of the \emph{size} of the rate in each interval is imposed by the
model, using the time scale as a quantitative covariate.

\subsection{Interval length}

In the first chapter we illustrated cutting and splitting by listing
the results for a few individuals across a number of intervals. For
illustrational purposes we used 5-year age bands to avoid excessive
listings, but since the doubling time for mortality on the age scale
is only slightly larger than 5 years, the assumption about constant
rates in each interval would be pretty far fetched if we were to use 5
year intervals.

Thus, for modeling purposes we split the follow-up in 3 month
intervals. When we use intervals of 3 months length it is superfluous
to split along multiple time scales --- the precise location of
tightly spaced splits will be irrelevant from any practical point of
view. \texttt{splitLexis} and \texttt{splitMulti} will allocate the
actual split times for all of the time scale variables, so these can be
used directly in modeling.

So we split the cut dataset in 3 months intervals along the age
scale:
\begin{Schunk}
\begin{Sinput}
> dmCs <- splitMulti( dmC, age = seq(0,110,1/4) )
> summary( dmCs, t=T )
\end{Sinput}
\begin{Soutput}
Transitions:
     To
From      DM   Ins Dead  Records:  Events: Risk time:  Persons:
  DM  189669  1694 2048    193411     3742   45885.49      9899
  Ins      0 34886  451     35337      451    8387.77      1791
  Sum 189669 36580 2499    228748     4193   54273.27      9996

Timescales:
  per   age   tfD   tfI
   ""    ""    "" "Ins"
\end{Soutput}
\end{Schunk}
We see that we now have 228,748 records and 9996 persons, so about 23
records per person. The total risk time is 54,275 years, a bit less
than 3 months on average per record as expected.

\subsection{Practicalities for splines}

In this study we want to look at how mortality depend on age
(\texttt{age}) and time since start of insulin use (\texttt{tfI}). If
we want to use splines in the description we must allocate knots for
anchoring the splines at each of the time scales, either by some
\textit{ad hoc} method or by using some sort of penalized splines as
for example by \texttt{gam}; the latter will not be treated here; it
belongs in the realm of the \texttt{mgcv} package.

Here we shall use the former approach and allocate 5 knots on each of
the time-scales. We allocate knots so that we have the events evenly
distributed between the knots. Since the insulin state starts at 0 for
all individuals we include 0 as the first knot, such that any set of natural
splines with these knots will have the value 0 at 0 on the time
scale.
\begin{Schunk}
\begin{Sinput}
> ( a.kn <- with( subset( dmCs, lex.Xst=="Dead" ),
+                 quantile( age+lex.dur, (1:5-0.5)/5 ) ) )
\end{Sinput}
\begin{Soutput}
     10%      30%      50%      70%      90%
60.29350 71.31937 77.72758 82.72745 89.86393
\end{Soutput}
\begin{Sinput}
> ( i.kn <- c( 0,
+           with( subset( dmCs, lex.Xst=="Dead" & lex.Cst=="Ins" ),
+                 quantile( tfI+lex.dur, (1:4)/5 ) ) ) )
\end{Sinput}
\begin{Soutput}
                20%       40%       60%       80%
0.0000000 0.3093771 1.1307324 2.5489391 4.9117043
\end{Soutput}
\end{Schunk}
In the \texttt{Epi} package there is a convenience wrapper,
\texttt{Ns}, for the \texttt{n}atural \texttt{s}pline generator
\texttt{ns}, that takes the smallest and the largest of a set of
supplied knots to be the boundary knots, so the explicit definition of
the boundary knots becomes superfluous.

Note that it is a feature of the \texttt{Ns} (via the features of
\texttt{ns}) that any generated spline function is 0 at the leftmost
knot.

\subsection{Poisson models}

A model that describes mortality rates as only a function of age would
then be:
\begin{Schunk}
\begin{Sinput}
> ma <- glm( (lex.Xst=="Dead") ~ Ns(age,knots=a.kn),
+             family = poisson,
+             offset = log(lex.dur),
+               data = dmCs )
> summary( ma )
\end{Sinput}
\begin{Soutput}
Call:
glm(formula = (lex.Xst == "Dead") ~ Ns(age, knots = a.kn), family = poisson,
    data = dmCs, offset = log(lex.dur))

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-0.5883  -0.1688  -0.1144  -0.0766   4.5958

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)
(Intercept)            -3.82830    0.03861  -99.16   <2e-16
Ns(age, knots = a.kn)1  1.36254    0.08723   15.62   <2e-16
Ns(age, knots = a.kn)2  1.49446    0.06845   21.83   <2e-16
Ns(age, knots = a.kn)3  2.63557    0.07058   37.34   <2e-16
Ns(age, knots = a.kn)4  1.94173    0.05769   33.66   <2e-16

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 27719  on 228747  degrees of freedom
Residual deviance: 25423  on 228743  degrees of freedom
AIC: 30431

Number of Fisher Scoring iterations: 8
\end{Soutput}
\end{Schunk}
The offset, \texttt{log(lex.dur)} comes from the fact that the
likelihood for the follow-up data during $\ell$ time is the same as
that for independent Poisson variates with mean $\lambda \ell$, and
that the default link function for the Poisson family is the log, so
that we are using a linear model for the log-mean,
$\log(\lambda) + \log(\ell)$.  But when we want a model for the
log-rate ($\log(\lambda)$), the term $\log(\ell)$ must still be
included as a covariate, but with regression coefficient fixed to 1; a
so-called \emph{offset}. This is however a technicality; it just
exploits that the likelihood of a particular Poisson model and that of
the rates model is the same.

In the \texttt{Epi} package is a \texttt{glm} family, \texttt{poisreg}
that has a more intuitive interface, where the response is a 2-column
matrix of events and person-time, respectively. This is in concert
with the fact that the outcome variable in follow-up studies is
bivariate: (event, risk time).
\begin{Schunk}
\begin{Sinput}
> Ma <- glm( cbind(lex.Xst=="Dead",lex.dur) ~ Ns(age,knots=a.kn),
+            family = poisreg, data = dmCs )
> summary( Ma )
\end{Sinput}
\begin{Soutput}
Call:
glm(formula = cbind(lex.Xst == "Dead", lex.dur) ~ Ns(age, knots = a.kn),
    family = poisreg, data = dmCs)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-0.5883  -0.1688  -0.1144  -0.0766   4.5958

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)
(Intercept)            -3.82830    0.03861  -99.15   <2e-16
Ns(age, knots = a.kn)1  1.36254    0.08723   15.62   <2e-16
Ns(age, knots = a.kn)2  1.49446    0.06845   21.83   <2e-16
Ns(age, knots = a.kn)3  2.63557    0.07058   37.34   <2e-16
Ns(age, knots = a.kn)4  1.94173    0.05769   33.66   <2e-16

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 27719  on 228747  degrees of freedom
Residual deviance: 25423  on 228743  degrees of freedom
AIC: 30431

Number of Fisher Scoring iterations: 7
\end{Soutput}
\end{Schunk}
Exploiting the multistate structure in the \texttt{Lexis} object
there is a multistate convenience wrapper for \texttt{glm} with the
\texttt{poisreg} family, that just requires specification of the
transitions in terms of \texttt{from} and \texttt{to}. Although it is
called \texttt{glm.Lexis} it is \emph{not} an S3 method for
\texttt{Lexis} objects:
\begin{Schunk}
\begin{Sinput}
> Xa <- glm.Lexis( dmCs, from="DM", to="Dead",
+                  formula = ~ Ns(age,knots=a.kn) )
\end{Sinput}
\begin{Soutput}
stats::glm Poisson analysis of Lexis object dmCs with log link:
Rates for the transition: DM->Dead
\end{Soutput}
\end{Schunk}
The result is a \texttt{glm} object but with an extra attribute, \texttt{Lexis}:
\begin{Schunk}
\begin{Sinput}
> attr( Xa, "Lexis" )
\end{Sinput}
\begin{Soutput}
$data
[1] "dmCs"

$trans
[1] "DM->Dead"

$formula
~Ns(age, knots = a.kn)
<environment: 0xb8adec8>

$scale
[1] 1
\end{Soutput}
\end{Schunk}
There are similar wrappers for \texttt{gam} and \texttt{coxph} models,
\texttt{gam.Lexis} and \texttt{coxph.Lexis}, but these will not be
elaborated in detail.

The \texttt{from=} and \texttt{to=} can even be omitted, in which case
all possible transitions \emph{into} any of the absorbing states is
modeled:
\begin{Schunk}
\begin{Sinput}
> xa <- glm.Lexis( dmCs, formula = ~ Ns(age,knots=a.kn) )
\end{Sinput}
\begin{Soutput}
stats::glm Poisson analysis of Lexis object dmCs with log link:
Rates for transitions: DM->Dead, Ins->Dead
\end{Soutput}
\end{Schunk}
We can check if the four models fitted are the same:
\begin{Schunk}
\begin{Sinput}
> c( deviance(ma), deviance(Ma), deviance(Xa), deviance(xa) )
\end{Sinput}
\begin{Soutput}
[1] 25422.92 25422.92 20902.31 25422.92
\end{Soutput}
\end{Schunk}
Oops! the model \texttt{Xa} is apparently not the same as the other
three? This is because the explicit specification
\verb|from="DM", to="Dead"|, omits modeling contributions from the
$\mathtt{Ins}\rightarrow\mathtt{Dead}$ transition --- the output
actually said so --- see also figure \ref{fig:box1} on
p. \pageref{fig:box1}. The other three models all use both transitions
--- and assume that the two transition rates are the same, \ie that
start of insulin has no effect on mortality. We shall relax this
assumption later.

The parameters from the model do not have any direct interpretation
\textit{per se}, but we can compute the estimated mortality rates for
a range of ages using \texttt{ci.pred} with a suitably defined
prediction data frame.

Note that if we use the \texttt{poisson} family of models, we must
specify all covariates in the model, including the variable in the
offset, \texttt{lex.dur} (remember that this was a covariate with
coefficient fixed at 1). We set the latter to 1000, because we want the
mortality rates per 1000 person-years. Using the \texttt{poisreg}
family, the prediction will ignore any value of \texttt{lex.dur}
specified in the prediction data frame, the returned rates will be per
unit in which \texttt{lex.dur} is recorded.
\begin{Schunk}
\begin{Sinput}
> nd <- data.frame( age=40:85, lex.dur=1000 )
> pr.0 <- ci.pred( ma, newdata = nd )      # mortality per 100 PY
> pr.a <- ci.pred( Ma, newdata = nd )*1000 # mortality per 100 PY
> summary(pr.0/pr.a)
\end{Sinput}
\begin{Soutput}
    Estimate      2.5%       97.5%
 Min.   :1   Min.   :1   Min.   :1
 1st Qu.:1   1st Qu.:1   1st Qu.:1
 Median :1   Median :1   Median :1
 Mean   :1   Mean   :1   Mean   :1
 3rd Qu.:1   3rd Qu.:1   3rd Qu.:1
 Max.   :1   Max.   :1   Max.   :1
\end{Soutput}
\begin{Sinput}
> matshade( nd$age, pr.a, plot=TRUE,
+           type="l", lty=1,
+           log="y", xlab="Age (years)",
+           ylab="DM mortality per 1000 PY")
\end{Sinput}
\end{Schunk}
\insfig{pr-a}{0.8}{Mortality among Danish diabetes patients by age
  with 95\% CI as shaded area. We see that the rates increase linearly
  on the log-scale, that is exponentially by age.}

\section{Time dependent covariate}

A Poisson model approach to mortality by insulin status, would be to
assume that the rate-ratio between patients on insulin and not on
insulin is a fixed quantity, independent of time since start of insulin,
independent of age. This is commonly termed a proportional hazards
assumption, because the rates (hazards) in the two groups are
proportional along the age (baseline time) scale.
\begin{Schunk}
\begin{Sinput}
> pm <- glm( cbind(lex.Xst=="Dead",lex.dur) ~ Ns(age,knots=a.kn)
+                                           + lex.Cst + sex,
+            family=poisreg, data = dmCs )
> round( ci.exp( pm ), 3 )
\end{Sinput}
\begin{Soutput}
                       exp(Est.)   2.5%  97.5%
(Intercept)                0.022  0.021  0.024
Ns(age, knots = a.kn)1     4.248  3.581  5.040
Ns(age, knots = a.kn)2     5.008  4.376  5.731
Ns(age, knots = a.kn)3    16.832 14.624 19.373
Ns(age, knots = a.kn)4     7.994  7.126  8.968
lex.CstIns                 1.985  1.791  2.200
sexF                       0.668  0.617  0.724
\end{Soutput}
\end{Schunk}
So we see that persons on insulin have about twice the mortality of
persons not on insulin and that women have 2/3 the mortality of men.

\subsection{Time since insulin start}

If we want to test whether the excess mortality depends on the time
since start if insulin treatment, we can add a spline terms in
\texttt{tfI}. But since \texttt{tfI} is a time scale defined as time
since entry into a new state (\texttt{Ins}), the variable \texttt{tfI}
will be missing for those in the \texttt{DM} state, so before modeling
we must set the \texttt{NA}s to 0, which we do with \texttt{tsNA20}
(acronym for \texttt{t}ime\texttt{s}cale \texttt{NA}s to zero):
\begin{Schunk}
\begin{Sinput}
> pm <- glm( cbind(lex.Xst=="Dead",lex.dur) ~ Ns(age,knots=a.kn)
+                                           + Ns(tfI,knots=i.kn)
+                                           + lex.Cst + sex,
+            family=poisreg, data = tsNA20(dmCs) )
\end{Sinput}
\end{Schunk}
As noted before we could do this simpler with \texttt{glm.Lexis}, even
without the \texttt{from=} and \texttt{to=} arguments, because we are
modeling all transitions \emph{into} the absorbing state
(\texttt{Dead}):
\begin{Schunk}
\begin{Sinput}
> Pm <- glm.Lexis( tsNA20(dmCs),
+                  form = ~ Ns(age,knots=a.kn)
+                         + Ns(tfI,knots=i.kn)
+                         + lex.Cst + sex )
\end{Sinput}
\begin{Soutput}
stats::glm Poisson analysis of Lexis object tsNA20(dmCs) with log link:
Rates for transitions: DM->Dead, Ins->Dead
\end{Soutput}
\begin{Sinput}
> c( deviance(Pm), deviance(pm) )
\end{Sinput}
\begin{Soutput}
[1] 25096.33 25096.33
\end{Soutput}
\begin{Sinput}
> identical( model.matrix(Pm), model.matrix(pm) )
\end{Sinput}
\begin{Soutput}
[1] TRUE
\end{Soutput}
\end{Schunk}
The coding of the effect of \texttt{tfI} is so that the value is 0 at
0, so the meaning of the estimate of the effect of \texttt{lex.Cst} is
the RR between persons with and without insulin, immediately after
start of insulin:
\begin{Schunk}
\begin{Sinput}
> round( ci.exp( Pm, subset="ex" ), 3 )
\end{Sinput}
\begin{Soutput}
           exp(Est.)  2.5% 97.5%
lex.CstIns     5.632 4.430  7.16
sexF           0.674 0.622  0.73
\end{Soutput}
\end{Schunk}
We see that the effect of sex is pretty much the same as before, but
the effect of \texttt{lex.Cst} is much larger, it now refers to a
different quantity, namely the RR at \texttt{tfI}=0. If we want to see
the effect of time since insulin, it is best viewed jointly with the
effect of age:
\begin{Schunk}
\begin{Sinput}
> ndI <- data.frame( expand.grid( tfI=c(NA,seq(0,15,0.1)),
+                                 ai=seq(40,80,10) ),
+                    sex="M",
+                    lex.Cst="Ins" )
> ndI <- transform( ndI, age=ai+tfI )
> head( ndI )
\end{Sinput}
\begin{Soutput}
  tfI ai sex lex.Cst  age
1  NA 40   M     Ins   NA
2 0.0 40   M     Ins 40.0
3 0.1 40   M     Ins 40.1
4 0.2 40   M     Ins 40.2
5 0.3 40   M     Ins 40.3
6 0.4 40   M     Ins 40.4
\end{Soutput}
\begin{Sinput}
> ndA <- data.frame( age= seq(40,100,0.1), tfI=0,  lex.Cst="DM", sex="M" )
> pri <- ci.pred( Pm, ndI ) * 1000
> pra <- ci.pred( Pm, ndA ) * 1000
> matshade( ndI$age, pri, plot=TRUE, las=1,
+           xlab="Age (years)", ylab="DM mortality per 1000 PY",
+           log="y", lty=1, col="blue" )
> matshade( ndA$age, pra )
\end{Sinput}
\end{Schunk}
\insfig{ins-time}{0.8}{Mortality rates of persons on insulin, starting
insulin at ages 40,50,\ldots,80 (blue), compared with persons not on
insulin (black curve). Shaded areas are 95\% CI.}

In figure \ref{fig:ins-time}, p. \pageref{fig:ins-time}, we see that
mortality is high just after insulin start, but falls by almost a
factor 3 during the first year. Also we see that there is a tendency
that mortality in a given age is smallest for those with the longest
duration of insulin use.

\section{The Cox model}

Note that in the Cox-model the age is used as response variable,
slightly counter-intuitive. Hence the age part of the linear predictors
is not in that model:
\begin{Schunk}
\begin{Sinput}
> library( survival )
> cm <- coxph( Surv(age,age+lex.dur,lex.Xst=="Dead") ~
+              Ns(tfI,knots=i.kn) + lex.Cst + sex,
+              data = tsNA20(dmCs) )
\end{Sinput}
\end{Schunk}
There is also a multistate wrapper for Cox models, requiring a
l.h.s. side for the \texttt{formula=} argument:
\begin{Schunk}
\begin{Sinput}
> Cm <- coxph.Lexis( tsNA20(dmCs),
+                    form= age ~ Ns(tfI,knots=i.kn) + lex.Cst + sex )
\end{Sinput}
\begin{Soutput}
survival::coxph analysis of Lexis object tsNA20(dmCs):
Rates for transitions DM->Dead, Ins->Dead
Baseline timescale: age
\end{Soutput}
\begin{Sinput}
> cbind( ci.exp( cm ), ci.exp( Cm ) )
\end{Sinput}
\begin{Soutput}
                       exp(Est.)       2.5%     97.5% exp(Est.)       2.5%     97.5%
Ns(tfI, knots = i.kn)1 0.2984062 0.19417148 0.4585960 0.2984062 0.19417148 0.4585960
Ns(tfI, knots = i.kn)2 0.3871151 0.29011380 0.5165495 0.3871151 0.29011380 0.5165495
Ns(tfI, knots = i.kn)3 0.1239128 0.06287008 0.2442238 0.1239128 0.06287008 0.2442238
Ns(tfI, knots = i.kn)4 0.4405121 0.34839015 0.5569932 0.4405121 0.34839015 0.5569932
lex.CstIns             5.6700284 4.45011220 7.2243623 5.6700284 4.45011220 7.2243623
lex.CstDead            1.0000000 1.00000000 1.0000000 1.0000000 1.00000000 1.0000000
sexF                   0.6753202 0.62316569 0.7318397 0.6753202 0.62316569 0.7318397
\end{Soutput}
\end{Schunk}
We can compare the estimates from the Cox model with those from the
Poisson model --- we must add \texttt{NA}s because the Cox-model does
not give the parameters for the baseline time scale (\texttt{age}), but
also remove one of the parameters, because \texttt{coxph} parametrizes
factors (in this case \texttt{lex.Cst}) by all defined levels and not
only by the levels present in the dataset at hand (note the line of
\texttt{1.0000000}s in the print above):
\begin{Schunk}
\begin{Sinput}
> round( cbind( ci.exp( Pm ),
+        rbind( matrix(NA,5,3),
+               ci.exp( cm )[-6,] ) ), 3 )
\end{Sinput}
\begin{Soutput}
                       exp(Est.)   2.5%  97.5% exp(Est.)  2.5% 97.5%
(Intercept)                0.022  0.021  0.024        NA    NA    NA
Ns(age, knots = a.kn)1     4.208  3.546  4.993        NA    NA    NA
Ns(age, knots = a.kn)2     5.012  4.380  5.736        NA    NA    NA
Ns(age, knots = a.kn)3    16.560 14.386 19.063        NA    NA    NA
Ns(age, knots = a.kn)4     7.921  7.061  8.885        NA    NA    NA
Ns(tfI, knots = i.kn)1     0.298  0.194  0.458     0.298 0.194 0.459
Ns(tfI, knots = i.kn)2     0.385  0.289  0.514     0.387 0.290 0.517
Ns(tfI, knots = i.kn)3     0.125  0.064  0.246     0.124 0.063 0.244
Ns(tfI, knots = i.kn)4     0.438  0.346  0.553     0.441 0.348 0.557
lex.CstIns                 5.632  4.430  7.160     5.670 4.450 7.224
sexF                       0.674  0.622  0.730     0.675 0.623 0.732
\end{Soutput}
\end{Schunk}
Thus we see that the Poisson and Cox gives pretty much the same
results. You may argue that Cox requires a smaller dataset, because
there is no need to subdivide data in small intervals \emph{before}
insulin use. But certainly the time \emph{after} insulin inception need
to be if the effect of this time should be modeled.

The drawback of the Cox-modeling is that it is not possible to show
the absolute rates as we did in figure \ref{fig:ins-time} on
\pageref{fig:ins-time}.

\section{Marginal effect of time since insulin}

When we plot the marginal effect of \texttt{tfI} from the two models
we get pretty much the same; here we plot the RR relative to
\texttt{tfI}=2 years. Note that we are deriving the RR as the ratio of
two sets of predictions, from the data frames \texttt{nd} and
\texttt{nr} --- for further details consult the help page for
\texttt{ci.lin}, specifically the use of a list as the
\texttt{ctr.mat} argument:
\begin{Schunk}
\begin{Sinput}
> nd <- data.frame( tfI=seq(0,15,,151), lex.Cst="Ins", sex="M" )
> nr <- data.frame( tfI=    2         , lex.Cst="Ins", sex="M" )
> ppr <- ci.exp( pm, list(nd,nr), xvars="age" )
> cpr <- ci.exp( cm, list(nd,nr) )
> par( mar=c(3,3,1,1), mgp=c(3,1,0)/1.6, las=1, bty="n" )
> matshade( nd$tfI, cbind(ppr,cpr), plot=T,
+           lty=c(1,2), log="y",
+           xlab="Time since insulin (years)", ylab="Rate ratio")
> abline( h=1, lty=3 )
\end{Sinput}
\end{Schunk}
\insfig{Ieff}{0.8}{The naked duration effects relative to 2 years of
  duration, black from Poisson model, red from Cox model. The two sets
  of estimates are identical, and so are the standard errors, so the
  two shaded areas overlap almost perfectly.}

In figure \ref{fig:Ieff}, p. \pageref{fig:Ieff}, we see that the
duration effect is exactly the same from the two modeling approaches.

We will also want the RR relative to the non-insulin users --- recall these
are coded 0 on the \texttt{tfI} variable:
\begin{Schunk}
\begin{Sinput}
> nd <- data.frame( tfI=seq(0,15,,151), lex.Cst="Ins", sex="M" )
> nr <- data.frame( tfI=    0         , lex.Cst="DM" , sex="M" )
> ppr <- ci.exp( pm, list(nd,nr), xvars="age" )
> cpr <- ci.exp( cm, list(nd,nr) )
> par( mar=c(3,3,1,1), mgp=c(3,1,0)/1.6, las=1, bty="n" )
> matshade( nd$tfI, cbind(ppr,cpr),
+           xlab="Time since insulin (years)",
+           ylab="Rate ratio relative to non-Insulin",
+           lty=c(1,2), log="y", plot=T )
\end{Sinput}
\end{Schunk}
\insfig{IeffR}{0.8}{Insulin duration effect (state \textrm{\tt Ins})
  relative to no insulin (state \textrm{\tt DM}), black from Poisson
  model, red from Cox model. The \emph{shape} is the same as the
  previous figure, but the RR is now relative to non-insulin, instead
  of relative to insulin users at 2 years duration. The two sets of
  estimates are identical, and so are the standard errors, so the two
  shaded areas overlap almost perfectly.}

In figure \ref{fig:IeffR}, p. \pageref{fig:IeffR}, we see the effect
of increasing duration of insulin use \emph{for a fixed age} which is
a bit artificial, so we would like to see the \emph{joint} effects of
age and insulin duration. What we cannot see is how the duration
affects mortality relative to \texttt{current} age (at the age
attained at the same time as the attained \texttt{tfI}).

Another way of interpreting this curve is as the rate ratio relative to a
person not on insulin, so we see that the RR (or hazard ratio, HR as
some call it) is over 5 at the start of insulin (the \texttt{lex.Cst}
estimate), and decreases to about 1.5 in the long term.

Both figure \ref{fig:Ieff} and \ref{fig:IeffR} indicate a declining
RR by insulin duration, but only from figure \ref{fig:ins-time} it is
visible that mortality actually is \emph{in}creasing by age after some
2 years after insulin start. This point would not be available if we
had only fitted a Cox model where we did not have access to the
baseline hazard as a function of age.

\section{Age$\times$duration interaction}

The model we fitted assumes that the RR is the same regardless of the
age at start of insulin --- the hazards are multiplicative. Sometimes
this is termed the proportional hazards assumption: For \emph{any}
fixed age the HR is the same as a function of time since insulin, and
vice versa.

A more correct term would be ``main effects model'' --- there is no
interaction between age (the baseline time scale) and other
covariates. So there is really no need for the term ``proportional
hazards''; well defined and precise statistical terms for it has
existed for aeons.

\subsection{Age at insulin start}

In order to check the consistency of the multiplicativity assumption
across the spectrum of age at insulin inception, we can fit an
interaction model. One approach to this would be using a non-linear
effect of age at insulin use (for convenience we use the same knots as
for age) --- note that the prediction data frames are the same as we
used above, because we do not compute age at insulin use as a separate
variable, but rather enter it as the difference between current age
(\texttt{age}) and insulin duration (\texttt{tfI}).

At first glance we might think of doing:
\begin{Schunk}
\begin{Sinput}
> imx <- glm.Lexis( tsNA20(dmCs),
+                  formula = ~ Ns(age    ,knots=a.kn)
+                            + Ns(    tfI,knots=i.kn)
+                            + Ns(age-tfI,knots=a.kn)
+                            + lex.Cst + sex )
\end{Sinput}
\begin{Soutput}
stats::glm Poisson analysis of Lexis object tsNA20(dmCs) with log link:
Rates for transitions: DM->Dead, Ins->Dead
\end{Soutput}
\end{Schunk}
But this will fit a model with a rate-ratio between persons with and
without insulin that depends only on age at insulin start for the time
\emph{after} insulin start, the RR at \texttt{tfI}=0 will be the same
at any age, which really is not the type of interaction we wanted.

We want the \texttt{age-tfI} term to be specific for the insulin
exposed so we may use one of two other approaches, that are
conceptually alike but mathematically different:
\begin{Schunk}
\begin{Sinput}
> Im <- glm.Lexis( tsNA20(dmCs),
+                  formula = ~ Ns(age    ,knots=a.kn)
+                            + Ns(    tfI,knots=i.kn)
+                            + Ns((age-tfI)*(lex.Cst=="Ins"),knots=a.kn)
+                            + lex.Cst + sex )
\end{Sinput}
\begin{Soutput}
stats::glm Poisson analysis of Lexis object tsNA20(dmCs) with log link:
Rates for transitions: DM->Dead, Ins->Dead
\end{Soutput}
\begin{Sinput}
> im <- glm.Lexis( tsNA20(dmCs),
+                  formula = ~ Ns(age    ,knots=a.kn)
+                            + Ns(    tfI,knots=i.kn)
+                            + lex.Cst:Ns(age-tfI,knots=a.kn)
+                            + lex.Cst + sex )
\end{Sinput}
\begin{Soutput}
stats::glm Poisson analysis of Lexis object tsNA20(dmCs) with log link:
Rates for transitions: DM->Dead, Ins->Dead
\end{Soutput}
\end{Schunk}
The first model (\texttt{Im}) has a common age-effect (\texttt{Ns(age,...}) for
persons with and without diabetes and a RR depending on insulin
duration \texttt{tfI} and age at insulin (\texttt{age-tfI}). Since the
linear effect of these two terms are in the model as well, a linear
trend in the RR by current age (\texttt{age}) is accommodated as well.

The second model allows age-effects that differ non-linearly between
person with and without insulin, because the interaction term
\texttt{lex.Cst:Ns(age-tfI...} for persons not on insulin is merely an
age term (since \texttt{tfI} is coded 0 for all follow-up not on
insulin).

We can compare the models fitted:
\begin{Schunk}
\begin{Sinput}
> anova( imx, Im, im, test='Chisq')
\end{Sinput}
\begin{Soutput}
Analysis of Deviance Table

Model 1: cbind(trt(Lx$lex.Cst, Lx$lex.Xst) %in% trnam, Lx$lex.dur) ~ Ns(age,
    knots = a.kn) + Ns(tfI, knots = i.kn) + Ns(age - tfI, knots = a.kn) +
    lex.Cst + sex
Model 2: cbind(trt(Lx$lex.Cst, Lx$lex.Xst) %in% trnam, Lx$lex.dur) ~ Ns(age,
    knots = a.kn) + Ns(tfI, knots = i.kn) + Ns((age - tfI) *
    (lex.Cst == "Ins"), knots = a.kn) + lex.Cst + sex
Model 3: cbind(trt(Lx$lex.Cst, Lx$lex.Xst) %in% trnam, Lx$lex.dur) ~ Ns(age,
    knots = a.kn) + Ns(tfI, knots = i.kn) + lex.Cst:Ns(age -
    tfI, knots = a.kn) + lex.Cst + sex
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1    228734      25096
2    228733      25087  1   8.9631 0.002755
3    228730      25082  3   4.6804 0.196749
\end{Soutput}
\end{Schunk}
so we see that the models indeed are different, and moreover that the
last model does not provide substantial further improvement, by
allowing non-linear RR along the age-scale.

We can illustrate the different estimated rates from the three models
in figure \ref{fig:dur-int}, p. \pageref{fig:dur-int}:
\begin{Schunk}
\begin{Sinput}
> pxi <- ci.pred( imx, ndI )
> pxa <- ci.pred( imx, ndA )
> pIi <- ci.pred( Im , ndI )
> pIa <- ci.pred( Im , ndA )
> pii <- ci.pred( im , ndI )
> pia <- ci.pred( im , ndA )
> par( mar=c(3,3,1,1), mgp=c(3,1,0)/1.6, las=1, bty="n" )
> matshade( ndI$age, cbind( pxi, pIi, pii)*1000, plot=T, log="y",
+           xlab="Age", ylab="Mortality per 1000 PY",
+           lty=1, lwd=2, col=c("blue","forestgreen","red"), alpha=0.1 )
> matshade( ndA$age, cbind( pxa, pIa, pia)*1000,
+           lty=1, lwd=2, col=c("blue","forestgreen","red"), alpha=0.1 )
\end{Sinput}
\end{Schunk}
\insfig{dur-int}{0.8}{Age at insulin as interaction between age and
  duration. Blue curves are from the naive interaction model
  \textrm{\tt imx} with identical $\RR$ at \textrm{\tt tfI}=0 at any
  age; green curves are from the interaction model with age at
  insulin, from the model \textrm{\tt Im} with only linear
  differences by age, and red lines from the full interaction model
  \textrm{\tt im}.}

We can also plot the RRs only from these models (figure
\ref{fig:dur-int-RR}, p. \pageref{fig:dur-int-RR}); for this we need
the reference frames, and the machinery from \texttt{ci.exp} allowing
a list of two data frames:
\begin{Schunk}
\begin{Sinput}
> ndR <- transform( ndI, tfI=0, lex.Cst="DM" )
> cbind( head(ndI), head(ndR) )
\end{Sinput}
\begin{Soutput}
  tfI ai sex lex.Cst  age tfI ai sex lex.Cst  age
1  NA 40   M     Ins   NA   0 40   M      DM   NA
2 0.0 40   M     Ins 40.0   0 40   M      DM 40.0
3 0.1 40   M     Ins 40.1   0 40   M      DM 40.1
4 0.2 40   M     Ins 40.2   0 40   M      DM 40.2
5 0.3 40   M     Ins 40.3   0 40   M      DM 40.3
6 0.4 40   M     Ins 40.4   0 40   M      DM 40.4
\end{Soutput}
\begin{Sinput}
> Rxi <- ci.exp( imx, list(ndI,ndR) )
> Rii <- ci.exp( im , list(ndI,ndR) )
> RIi <- ci.exp( Im , list(ndI,ndR) )
> par( mar=c(3,3,1,1), mgp=c(3,1,0)/1.6, las=1, bty="n" )
> matshade( ndI$age, cbind( Rxi, RIi, Rii), plot=T, log="y",
+           xlab="Age (years)", ylab="Rate ratio vs, non-Insulin",
+           lty=1, lwd=2, col=c("blue","forestgreen","red"), alpha=0.1 )
> abline( h=1 )
> abline( h=ci.exp(imx,subset="lex.Cst")[,1], lty="25", col="blue" )
\end{Sinput}
\end{Schunk}
\insfig{dur-int-RR}{0.9}{RR from three different interaction
  models. The horizontal dotted line is at the estimated effect of
  \textrm{\tt lex.Cst}, to illustrate that the first model (blue)
  constrains this initial HR to be constant across age. The green
  curves are the extended interaction model, and the red the full
  one.}

\clearpage

\subsection{General interaction}

As a final illustration we may want to explore a different kind of
interaction, not defined from the duration --- here we simplify the
interaction by not using the second-last knot in the interaction terms
--- figure \ref{fig:splint}, p. \pageref{fig:splint}. Note again that
the prediction code is the same:
\begin{Schunk}
\begin{Sinput}
> gm <- glm.Lexis( tsNA20(dmCs),
+                  formula = ~ Ns(age,knots=a.kn)
+                            + Ns(tfI,knots=i.kn)
+                            + lex.Cst:Ns(age,knots=a.kn):Ns(tfI,knots=i.kn)
+                            + lex.Cst + sex )
\end{Sinput}
\begin{Soutput}
stats::glm Poisson analysis of Lexis object tsNA20(dmCs) with log link:
Rates for transitions: DM->Dead, Ins->Dead
\end{Soutput}
\begin{Sinput}
> pgi <- ci.pred( gm, ndI )
> pga <- ci.pred( gm, ndA )
> par( mar=c(3,3,1,1), mgp=c(3,1,0)/1.6, las=1, bty="n" )
> matshade( ndI$age, cbind( pgi, pii )*1000,  plot=T,
+           lty=c("solid","21"), lend="butt", lwd=2, log="y",
+           xlab="Age (years)", ylab="Mortality rates per 1000 PY",
+           alpha=c(0.2,0.1), col=c("black","red") )
> matshade( ndA$age, cbind( pga, pia )*1000,
+           lty=c("solid","21"), lend="butt", lwd=2,
+           alpha=c(0.2,0.1), col=c("black","red") )
\end{Sinput}
\end{Schunk}
\insfig{splint}{0.8}{Spline-by-spline interaction between age and
  duration (model \textrm{\tt gm}, black), and the interaction using a
  non-linear effect of age at entry (model \textrm{\tt im}, red),
  corresponding to the red curves in figure \ref{fig:dur-int}.}
This is in figure \ref{fig:splint}, p. \pageref{fig:splint}.

\subsection{Evaluating interactions}

Here we see that the interaction effect is such that in the older ages
the length of insulin use has an increasing effect on mortality.

Even though there is no statistically significant interaction between
age and time since start of insulin, it would be illustrative to show
the RR as a function of age at insulin and age at follow-up:
\begin{Schunk}
\begin{Sinput}
> ndR <- transform( ndI, lex.Cst="DM", tfI=0 )
> iRR <- ci.exp( im, ctr.mat=list(ndI,ndR) )
> gRR <- ci.exp( gm, ctr.mat=list(ndI,ndR) )
> par( mar=c(3,3,1,1), mgp=c(3,1,0)/1.6, las=1, bty="n" )
> matshade( ndI$age, cbind(gRR,iRR), lty=1, log="y", plot=TRUE,
+           xlab="Age (years)", ylab="Rate ratio: Ins vs. non-Ins",
+           col=c("black","red") )
> abline( h=1 )
\end{Sinput}
\end{Schunk}
\insfig{RR-int}{0.8}{The effect of duration of insulin use at
  different ages of follow-up (and age at insulin start). Estimates
  are from the model with an interaction term using a non-linear
  effect of age at insulin start (model \textrm{\tt im}, red) and
  using a general spline interactions (model \textrm{\tt gm},
  black). It appears that the general interaction over-models a bit.}
This is in figure \ref{fig:RR-int}, p. \pageref{fig:RR-int}.

The advantage of the parametric modeling (be that with age at insulin
or general spline interaction) is that it is straight-forward to
\emph{test} whether we have an interaction.

\section{Separate models}

In the above we insisted on making a joint model for the
\texttt{DM}$\rightarrow$\texttt{Dead} and the
\texttt{Ins}$\rightarrow$\texttt{Dead}
transitions, but with the complications demonstrated it would actually
have been more sensible to model the two transitions separately:
\begin{Schunk}
\begin{Sinput}
> dmd <- glm.Lexis( dmCs,
+                   from="DM", to="Dead",
+                   formula = ~ Ns(age,knots=a.kn)
+                             + sex )
\end{Sinput}
\begin{Soutput}
stats::glm Poisson analysis of Lexis object dmCs with log link:
Rates for the transition: DM->Dead
\end{Soutput}
\begin{Sinput}
> ind <- glm.Lexis( dmCs,
+                   from="Ins", to="Dead",
+                   formula = ~ Ns(age,knots=a.kn)
+                             + Ns(tfI,knots=i.kn)
+                             + Ns(age-tfI,knots=a.kn)
+                             + sex )
\end{Sinput}
\begin{Soutput}
stats::glm Poisson analysis of Lexis object dmCs with log link:
Rates for the transition: Ins->Dead
\end{Soutput}
\begin{Sinput}
> ini <- ci.pred( ind, ndI )
> dmi <- ci.pred( dmd, ndI )
> dma <- ci.pred( dmd, ndA )
\end{Sinput}
\end{Schunk}
The estimated mortality rates are shown in figure \ref{fig:sep-mort},
p. \pageref{fig:sep-mort}, using:
\begin{Schunk}
\begin{Sinput}
> par(mar=c(3,3,1,1),mgp=c(3,1,0)/1.6,las=1,bty="n")
> matshade( ndI$age, ini*1000, plot=TRUE, log="y",
+           xlab="Age (years)", ylab="Mortality rates per 1000 PY",
+           lwd=2, col="red" )
> matshade( ndA$age, dma*1000,
+           lwd=2, col="black" )
\end{Sinput}
\end{Schunk}
The estimated RRs are computed using that the estimates from the two
models are uncorrelated, and hence qualify for \texttt{ci.ratio} (this
and the previous graph
appear in figure \ref{fig:Ins-noIns}, p. \pageref{fig:Ins-noIns})
\begin{Schunk}
\begin{Sinput}
> par(mar=c(3,3,1,1),mgp=c(3,1,0)/1.6,las=1,bty="n")
> matshade( ndI$age, ci.ratio(ini,dmi), plot=TRUE, log="y",
+           xlab="Age (years)", ylab="RR insulin vs. no insulin",
+           lwd=2, col="red" )
> abline( h=1 )
\end{Sinput}
\end{Schunk}
\begin{figure}[tb]
\centering
\includegraphics[width=0.49\textwidth]{flup-sep-mort}
\includegraphics[width=0.49\textwidth]{flup-sep-HR}
\caption{\it Left panel: Mortality rates from separate models for the
  two mortality transitions; the \textrm{\tt
    DM}$\rightarrow$\textrm{\tt Dead} transition modeled by age alone;
  \textrm{\tt Ins}$\rightarrow$\textrm{\tt Dead} transition modeled
  with spline effects of current age, time since insulin and and age
  at insulin. \newline Right panel: Mortality HR of insulin vs. no insulin.}
\label{fig:Ins-noIns}
\end{figure}

\chapter{More states}

\section{Subdividing states}

It may be of interest to subdivide the states following the
intermediate event according to whether the event has occurred or
not. This will enable us to address the question of the fraction of
the patients that ever go on insulin.

This is done by the argument \texttt{split.states=TRUE}.
\begin{Schunk}
\begin{Sinput}
> dmCs <- cutLexis( data = dmS2,
+                     cut = dmS2$doins,
+               timescale = "per",
+               new.state = "Ins",
+               new.scale = "tfI",
+        precursor.states = "DM",
+            split.states = TRUE )
> summary( dmCs )
\end{Sinput}
\begin{Soutput}
Transitions:
     To
From     DM  Ins Dead Dead(Ins)  Records:  Events: Risk time:  Persons:
  DM  35135 1694 2048         0     38877     3742   45885.49      9899
  Ins     0 5762    0       451      6213      451    8387.77      1791
  Sum 35135 7456 2048       451     45090     4193   54273.27      9996
\end{Soutput}
\end{Schunk}
We can illustrate the numbers and the transitions (figure
\ref{fig:box4}, p. \pageref{fig:box4})
\begin{Schunk}
\begin{Sinput}
> boxes( dmCs, boxpos=list(x=c(15,15,85,85),
+                          y=c(85,15,85,15)),
+        scale.R=1000, show.BE=TRUE )
\end{Sinput}
\end{Schunk}
\insfig{box4}{0.7}{Transitions between 4 states: the numbers \emph{in}
  the boxes are person-years (middle), and below the number of persons
  who start, respectively end their follow-up in each of the states.}

Note that it is only the mortality rates that we have been modeling,
namely the transitions \texttt{DM}$\rightarrow$\texttt{Dead}
and \texttt{Ins}$\rightarrow$\texttt{Dead(Ins)}.
If we were to model the cumulative risk of using insulin we would also
need a model for the DM$\rightarrow$Ins
transition. Subsequent to that we would then compute the probability
of being in each state conditional on suitable starting
conditions. With models where transition rates depend on several time
scales this is not a trivial task. This is treated in more detail in the
vignette on \texttt{simLexis}.

\section{Multiple intermediate events}

We may be interested in starting either insulin or OAD (oral
anti-diabetic drugs), thus giving rise to more states and more
time scales. This can be accomplished by the \texttt{mcutLexis}
function, that generalizes \texttt{cutLexis}:
\begin{Schunk}
\begin{Sinput}
> dmM <- mcutLexis( dmL,
+              timescale = "per",
+                     wh = c("doins","dooad"),
+             new.states = c("Ins","OAD"),
+             new.scales = c("tfI","tfO"),
+       precursor.states = "DM",
+           ties.resolve = TRUE )
\end{Sinput}
\begin{Soutput}
NOTE: 15 records with tied events times resolved (adding 0.01 random uniform),
      so results are only reproducible if the random number seed was set.
\end{Soutput}
\begin{Sinput}
> summary( dmM, t=T )
\end{Sinput}
\begin{Soutput}
Transitions:
     To
From        DM Dead  OAD  Ins OAD-Ins Ins-OAD  Records:  Events: Risk time:  Persons:
  DM      2830 1056 2957  689       0       0      7532     4702   22920.34      7532
  OAD        0  992 3327    0    1005       0      5324     1997   22965.24      5324
  Ins        0  152    0  462       0     172       786      324    3883.06       786
  OAD-Ins    0  266    0    0     739       0      1005      266    3770.53      1005
  Ins-OAD    0   33    0    0       0     139       172       33     734.09       172
  Sum     2830 2499 6284 1151    1744     311     14819     7322   54273.27      9996

Timescales:
  per   age   tfD   tfI   tfO
   ""    ""    "" "Ins" "OAD"
\end{Soutput}
\end{Schunk}
We see that we now have two time scales defined as entry since into
states.
\begin{Schunk}
\begin{Sinput}
> wh <- c(subset(dmM,lex.Cst=="Ins-OAD")$lex.id[1:2],
+         subset(dmM,lex.Cst=="OAD-Ins")$lex.id[1:2])
> options( width=110 )
> print( subset( dmM, lex.id %in% wh )[,c('lex.id',names(dmM[1:8]),c("doins","dooad"))],
+        digits=6, row.names=FALSE )
\end{Sinput}
\begin{Soutput}
 lex.id        tfI     tfO     per     age       tfD    lex.dur lex.Cst lex.Xst   doins   dooad
     18         NA      NA 1996.75 61.7221 0.0000000 1.16906229      DM     OAD 2005.99 1997.92
     18         NA 0.00000 1997.92 62.8912 1.1690623 8.07939767     OAD OAD-Ins 2005.99 1997.92
     18 0.00000000 8.07940 2005.99 70.9706 9.2484600 4.00273785 OAD-Ins OAD-Ins 2005.99 1997.92
     25         NA      NA 2003.69 60.3422 0.0000000 1.88090349      DM     OAD 2008.64 2005.57
     25         NA 0.00000 2005.57 62.2231 1.8809035 3.06913073     OAD OAD-Ins 2008.64 2005.57
     25 0.00000000 3.06913 2008.64 65.2923 4.9500342 1.35797399 OAD-Ins OAD-Ins 2008.64 2005.57
     20         NA      NA 2009.25 53.2183 0.0000000 0.04071988      DM     Ins 2009.29 2009.29
     20 0.00000000      NA 2009.29 53.2591 0.0407199 0.00131847     Ins Ins-OAD 2009.29 2009.29
     20 0.00131847 0.00000 2009.29 53.2604 0.0420383 0.70813277 Ins-OAD Ins-OAD 2009.29 2009.29
     38         NA      NA 2008.37 63.9316 0.0000000 0.09308693      DM     Ins 2008.46 2008.67
     38 0.00000000      NA 2008.46 64.0246 0.0930869 0.21355236     Ins Ins-OAD 2008.46 2008.67
     38 0.21355236 0.00000 2008.67 64.2382 0.3066393 1.32511978 Ins-OAD    Dead 2008.46 2008.67
\end{Soutput}
\end{Schunk}
We can also illustrate the transitions to the different states, as in
figure \ref{fig:mbox}:
\begin{Schunk}
\begin{Sinput}
> boxes( dmM, boxpos=list(x=c(15,80,40,40,85,85),
+                         y=c(50,50,90,10,90,10)),
+             scale.R=1000, show.BE=TRUE )
\end{Sinput}
\end{Schunk}
\insfig{mbox}{1.0}{Boxes for the \textrm{\tt dmM} object. The numbers
  \emph{in} the boxes are person-years (middle), and below the number
  of persons who start, respectively end their follow-up in each of
  the states.}
We may not be interested in whether persons were prescribed insulin
before OAD or vice versa, in which case we would combine the levels
with both insulin and OAD to one, regardless of order (figure
\ref{fig:mboxr}):
\begin{Schunk}
\begin{Sinput}
> summary( dmMr <- Relevel( dmM, list('OAD+Ins'=5:6), first=FALSE) )
\end{Sinput}
\begin{Soutput}
Transitions:
     To
From        DM Dead  OAD  Ins OAD+Ins  Records:  Events: Risk time:  Persons:
  DM      2830 1056 2957  689       0      7532     4702   22920.34      7532
  OAD        0  992 3327    0    1005      5324     1997   22965.24      5324
  Ins        0  152    0  462     172       786      324    3883.06       786
  OAD+Ins    0  299    0    0     878      1177      299    4504.62      1177
  Sum     2830 2499 6284 1151    2055     14819     7322   54273.27      9996
\end{Soutput}
\begin{Sinput}
> boxes( dmMr, boxpos=list(x=c(15,50,15,85,85),
+                          y=c(85,50,15,85,15)),
+              scale.R=1000, show.BE=TRUE )
\end{Sinput}
\end{Schunk}
\insfig{mboxr}{1.0}{Boxes for the \textrm{\tt dmMr} object with
  collapsed states. The numbers \emph{in} the boxes are person-years
  (middle), and below the number of persons who start, respectively
  end their follow-up in each of the states.}

Diagrams as those in figures
\ref{fig:mbox} and
\ref{fig:mboxr} gives an overview of the possible transitions,
which states it might be relevant to collapse, and which transitions
to model and how.

The actual modeling of the transition rates is straightforward;
split the data along some timescale and then use \texttt{glm.Lexis} or
\texttt{gam.Lexis}, where it is possible to select the transitions
modelled. This is also possible with the \texttt{coxph.Lexis}
function, but it requires that a single time scale be selected as the
baseline time scale, and the effect of this will not be accessible.

\chapter{\texttt{Lexis} functions}

The \texttt{Lexis} machinery has evolved over time since it was first
introduced in a workable version in \texttt{Epi\_1.0.5} in August 2008.

Over the years there have been additions of tools for handling
multistate data. Here is a list of the current functions relating to
\texttt{Lexis} objects with a very brief description; it does not
replace the documentation. Unless otherwise stated, functions named
\texttt{something.Lexis} (with a ``\texttt{.}'')  are S3 methods for
\texttt{Lexis} objects, so you can skip the ``\texttt{.Lexis}'' in
daily use.

\setlist{noitemsep}
\begin{description}

\item[Define]\ \\
\begin{description}
\item[\texttt{Lexis}] defines a \texttt{Lexis} object
\end{description}

\item[Cut and split]\ \\
\begin{description}
\item[\texttt{cutLexis}] cut follow-up at intermediate event
\item[\texttt{mcutLexis}] cut follow-up at several intermediate events
\item[\texttt{countLexis}] cut follow-up at intermediate event count
  the no. events so far
\item[\texttt{splitLexis}] split follow up along a time scale
\item[\texttt{splitMulti}] split follow up along a time scale --- from
  the \texttt{popEpi} package, faster and has simpler syntax than
  \texttt{splitLexis}
\item[\texttt{addCov.Lexis}] add clinical measurements at a given date to a
  \texttt{Lexis} object
\end{description}

\item[Boxes and plots]\ \\
\begin{description}
\item[\texttt{boxes.Lexis}] draw a diagram of states and transitions
\item[\texttt{plot.Lexis}] draw a standard Lexis diagram
\item[\texttt{points.Lexis}] add points to a Lexis diagram
\item[\texttt{lines.Lexis}] add lines to a Lexis diagram
\item[\texttt{PY.ann.Lexis}] annotate life lines in a Lexis diagram
\end{description}

\item[Summarize and query]\ \\
\begin{description}
\item[\texttt{summary.Lexis}] overview of transitions, risk time etc.
\item[\texttt{levels.Lexis}] what are the states in the \texttt{Lexis} object
\item[\texttt{nid.Lexis}] number of persons in the \texttt{Lexis}
  object --- how many unique values of \texttt{lex.id} are present
\item[\texttt{entry}] entry time
\item[\texttt{exit}] exit time
\item[\texttt{status}] status at entry or exit
\item[\texttt{timeBand}] factor of time bands
\item[\texttt{timeScales}] what time scales are in the \texttt{Lexis} object
\item[\texttt{timeSince}] what time scales are defined as time since a given state
\item[\texttt{breaks}] what breaks are currently defined
\item[\texttt{absorbing}] what are the absorbing states
\item[\texttt{transient}] what are the transient states
\item[\texttt{preceding}, \texttt{before}] which states precede this
\item[\texttt{succeeding}, \texttt{after}] which states can follow this
\item[\texttt{tmat.Lexis}] transition matrix for the \texttt{Lexis} object
\end{description}

\item[Manipulate]\ \\
\begin{description}
\item[\texttt{subset.Lexis}, \texttt{[}] subset of a \texttt{Lexis} object
\item[\texttt{merge.Lexis}] merges a \texttt{Lexis} objects with a
  \texttt{data.frame}
\item[\texttt{cbind.Lexis}] bind a \texttt{data.frame} to a \texttt{Lexis} object
\item[\texttt{rbind.Lexis}] put two \texttt{Lexis} objects head-to-foot
\item[\texttt{transform.Lexis}] transform and add variables
\item[\texttt{tsNA20}] turn \texttt{NA}s to 0s for time scales
\item[\texttt{Relevel.Lexis}, \texttt{factorize.Lexis}] reorder and
  combine states
\item[\texttt{rm.tr}] remove transitions from a \texttt{Lexis} object
\item[\texttt{bootLexis}] bootstrap sample of \emph{persons}
  (\texttt{lex.id}) in the \texttt{Lexis} object
\end{description}

\item[Simulate]\ \\
\begin{description}
\item[\texttt{simLexis}] simulate a \texttt{Lexis} object from
  specified transition rate models
\item[\texttt{nState}, \texttt{pState}] count state occupancy from a
  simulated \texttt{Lexis} object
\item[\texttt{plot.pState}, \texttt{lines.pState}] plot state occupancy from a
  \texttt{pState} object
\end{description}

\item[Stack]\ \\
\begin{description}
\item[\texttt{stack.Lexis}] make a stacked object for simultaneous
  analysis of transitions --- returns a \texttt{stacked.Lexis} object
\item[\texttt{subset.stacked.Lexis}] subsets of a \texttt{stacked.Lexis} object
\item[\texttt{transform.stacked.Lexis}] transform  a \texttt{stacked.Lexis} object
\end{description}

\item[Interface to other packages]\ \\
\begin{description}
\item[\texttt{msdata.Lexis}] interface to \texttt{mstate} package
\item[\texttt{etm.Lexis}] interface to \texttt{etm} package
\item[\texttt{crr.Lexis}] interface to \texttt{cmprsk} package
\end{description}

\item[Statistical models] --- these are \emph{not} S3 methods
\begin{description}
\item[\texttt{glm.Lexis}] fit a \texttt{glm} model using the
  \texttt{poisreg} family to (hopefully) time split data
\item[\texttt{gam.Lexis}] fit a \texttt{gam} model (from the
  \texttt{mgcv} package) using the \texttt{poisreg} family to
  (hopefully) time split data
\item[\texttt{coxph.Lexis}] fit a Cox model to follow-up in a
  \texttt{Lexis} object
\end{description}

\end{description}

\renewcommand{\bibname}{References}

\bibliographystyle{plain}
\begin{thebibliography}{1}

\bibitem{Carstensen.2011a}
B~Carstensen and M~Plummer.
\newblock Using {L}exis objects for multi-state models in {R}.
\newblock {\em Journal of Statistical Software}, 38(6):1--18, 1 2011.

%% \bibitem{Iacobelli.2013}
%% S~Iacobelli and B~Carstensen.
%% \newblock {Multiple time scales in multi-state models}.
%% \newblock {\em Stat Med}, 32(30):5315--5327, Dec 2013.

\bibitem{Plummer.2011}
M~Plummer and B~Carstensen.
\newblock Lexis: An {R} class for epidemiological studies with long-term
  follow-up.
\newblock {\em Journal of Statistical Software}, 38(5):1--12, 1 2011.

\end{thebibliography}

\addcontentsline{toc}{chapter}{\bibname}

\end{document}