Search
Who’s Accountable for AI and its Risks? Why Enterprise CEOs Need to Assign AI Ownership Now
Webinar | Tuesday, April 30th | 1pm ET
Search

# An introduction to survival analysis

It’s one of mankind’s oldest questions – how long will it run? How long
will I live? Predicting the length of life has occupied thinkers,
scientists, and everybody else throughout history.

What factors influence survival? Why do some people with similar
characteristics have different lifespans? Bills of Mortality, published
in London beginning in the 1590’s up through the 1800’s, were one of the
first comprehensive records of causes of death. These were regular
compendiums collected from London’s parishes. As you can see, back in
1801 there were a wide variety of recorded causes, ranging from the
expected (“Aged”) to ones that have largely been eliminated (“Small
pox”, “Consumption”), to the downright scary (“Evil”).

## Beginnings

More formal attempts at survival analysis date back to life tables
constructed in the 17th century by Graunt and Halley. Halley (of comet
fame) used them to compute annuity values (among other things), this
data coming from Breslau, Germany and published in 1693:

library(tibble)
print(halley2[, 1:8])

##   Age Persons Age.1 Persons.1 Age.2 Persons.2 Age.3 Persons.3
## 1   1    1000     8       680    15       628    22       586
## 2   2     855     9       670    16       622    23       579
## 3   3     798    10       661    17       616    24       573
## 4   4     760    11       653    18       610    25       567
## 5   5     732    12       646    19       604    26       560
## 6   6     710    13       640    20       598    27       553
## 7   7     692    14       634    21       592    28       546


Basically, one keeps track of the number of individuals alive at each
year (or other interval) of age. From this, we can compute a simple
statistic, a Life Table Estimator, that gives the probability, at
birth, of surviving $T$ years or more:

$$hat{S}(T) = prod_{t = 0}^{T-1} left( 1 – frac{d_t}{n_t} right)$$

where $n_t$ are the
number alive at age $t$ and $d_t$ are the number
that die by age $t + 1$.

This is a cross-sectional estimate of survival, and assumes that the
survival rate doesn’t change over the time frame covered by the data.
This may or may not be a good assumption – certainly unlikely to be true
for many societies today, as living conditions continue to improve in
many parts of the world. Plotting a smoothed version of survival
$S$ (Figure 2) shows a sharp drop in survival for
infants, and then an almost constant decline all the way out to age 80
or so (granted, this was in the 1650’s).

x9 <- halley
x9$AtRisk <- x9$Persons
x9$Deceased <- -c(0, diff(x9$AtRisk, 1))
x9$Rate <- 1 - x9$Deceased/x9$AtRisk x9$SurvRate <- cumprod(x9$Rate) x9 %>% ggplot() + geom_line(aes(x = Age, y = SurvRate)) + ggtitle("Figure 2. Smoothed Survival Curve")  From the same data we can also plot (Figure 3) the mortality or hazard rate , that is, the instantaneous rate of change in survival, which shows the classic bathtub-curve shape — a sharp drop in mortality after birth, then a near constant period corresponding to the prime of life, with mortality again increasing as old age approaches. x9 %>% dplyr::filter(Age > 1, Age < 84) %>% ggplot() + geom_line(aes(x = Age, y = 1 - Rate), color = "red") -> plot2 plot2 + ylab("Smoothed Mortality Rate") + ggtitle("Figure 3. Smoothed Hazard Rate Curve")  ## Modern survival analysis The field of survival analysis has come a long ways since these and other pioneering efforts. With the explosion of mathematical and statistical theory in the 20th century and the ongoing advances in computing, we are now able to analyze large quantities of survival and reliability data and assess what underlying causes of death or failure. Insurance, manufacturing, medicine, all rely on statistical models of frailty and survival to inform business decisions, maintenance schedules, and patient treatment. Survival analysis is a vital and burgeoning area of research, and new methodologies are continually emerging. Using methods analogous to those found in linear regression, we can assess differences in survival based on different explanatory or environmental factors. As an example, consider data collected on long cancer deaths by age and gender: library(survival) head(lung)  ## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss ## 1 3 306 2 74 1 1 90 100 1175 NA ## 2 3 455 2 68 1 0 90 90 1225 15 ## 3 3 1010 1 56 1 0 90 90 NA 15 ## 4 5 210 2 57 1 1 90 60 1150 11 ## 5 1 883 2 60 1 0 100 90 NA 0 ## 6 12 1022 1 74 1 1 50 80 513 0  A modern version of a life table estimator, known as the Kaplan-Meier Estimator , displays the overall survival curve in Figure 4: lung$SurvObj <- with(lung, Surv(time, status == 2))
lung.S1 <- survfit(SurvObj ~ 1, data = lung, conf.type = "log-log")
ggsurvplot(lung.S1, conf.int = T) + ggtitle("Figure 4. Overall survival (days)")


A natural question is to ask what effect, if any, gender has on the
survival of lung cancer patients. This can be easily determined
graphically (in Figure 5) as well as statistically, showing separate
survival curves for men (red) and women (green).

lung.S2 <- survfit(SurvObj ~ sex, data = lung,
conf.type = "log-log")
lung.S2

## Call: survfit(formula = SurvObj ~ sex, data = lung, conf.type = "log-log")
##
##         n events median 0.95LCL 0.95UCL
## sex=1 138    112    270     210     306
## sex=2  90     53    426     345     524

ggsurvplot(lung.S2, conf.int = T) + ggtitle("Figure 5.  Survival curves by gender (days)")


In this case, women have a median survival time of 426 days vs. 270 for
men, and this difference is significant at a 95% level of confidence –
the upper and lower confidence limits for median survival of the two
groups do not overlap.

In closing, this blog post has only scratched the surface of survival
analysis techniques. A list of more sophisticated models for survival
include:

• parametric models (used especially in manufacturing and engineering
reliability studies)
• semi-parametric models (such as the Cox Proportional Hazards Model,
which allows for analysis of censored data)
• competing risk models
• Bayesian models,
• models with time-varying covariates and parameters

A good place to start for further research is to look at the R package
survival by Therneau. A good general reference for survival analysis
methodology is “Survival Analysis: Techniques for Censored and Truncated
Data” by Klein and Moeschberger.

## You might also enjoy

##### The EU AI Act is Approved: What is the Minimum Viable Governance That Global Enterprises Need to Comply With Regulatory Requirements?

Many equate governance with “process overhead” or “big brother” watchdogs, often stalling innovation and reducing productivity. So how does an enterprise get started with the right level of governance to protect the organization, without stymying innovation?

##### AI Regulations: What to Know & What to Do Now

Global, federal, and state-level governments are moving quickly to implement AI regulations. While reading this, you may be asking, “If I want to use AI, what do I need to do now to prepare my organization now?”

##### Benefits and Risks of Artificial Intelligence: 3 Lessons on Managing AI Risk and Reward

Three noteworthy conversations to help executives and AI leaders weigh the untapped potential of AI versus the risks.

Get the Latest News in Your Inbox

##### Product Demo: Govern Generative AI with the New ModelOp Center version 3.2

Watch a demo of the new ModelOp Center version 3.2 hosted by ModelOp’s Dave Trier, VP of Product.