Why you shouldn’t “hold-out” data in survival-model predictions

Dec 8, 2020 5 min read Data Analysis, Data Science

In nearly all cases, the proper way to make predictions on a subset of your data is by holding-out the data you want to predict, training a model on the remaining data, then predicting the outcome on the held-out data using the trained model. The reason is that this procedure ostensibly captures how you would use this model in practice: train the model on all the data you have, then predict for new data where the outcome is unknown. Cross-validation follows this procedure as well. However, that logic (slightly) broke down for an assignment in a class I TA'ed this semester. The confusion was common enough that I thought it warranted some deeper explanation. This post summarizes an answer I gave during office hours and assumes an advanced undergraduate level of statistics background, along with familiarity with Bayesian statistics.

Suppose you are modeling the lifespan of world leaders. You are given a dataset of Popes, US Presidents, Dali Lamas, Japanese Emperors, and Chinese Emperors. The data include various demographic data: how long they lived, the age and year they assumed office, position held, they year they died, and if they are currently living. The task given to students was to predict how much longer the currently living leaders would survive (5 Presidents, 2 Japanese Emperors, 2 Popes, and 1 Dalai Lama).¹ Should you train a model on the deceased leaders, then predict for lifespan for the living leaders? Many students took this approach. The answer, as you can surmise from the fact that this post exists, is no. You can, and should, train a lifespan model using the data from living leaders as well.

But first, some notation. In the traditional hold-out method, you pretend you do not know the outcome \(Y_i\) for some data \(i\) in your hold-out set and you predict that \(Y_i\) using \(X_i\), covariate information on unit \(i\), and \(\{Y_j,X_j\}\) for \(j\) in the observed or training data. That is, you build a model that estimates \(Y_j\) given data \(X_j\), then apply that model to the hold-out data \(X_i\) to get an estimate of \(Y_i\). I will refer to the set of all fully observed data \(\{Y_j,X_j\}\) as \(Y^{obs}\).

To make this a little more concrete, suppose you believe that lifespan for leaders follows a log-normal distribution, such that \(log(Y_i) \sim N(\beta_0 + \beta_1 X_{i1} + \ldots,\sigma^2)\). That is, the mean is a linear function of the predictors with a common variance term.² The form of the model is not particularly important here, just that it has some sort of structure. If we know the parameters \(\beta\) and \(\sigma^2\), then we wouldn't need any training data at all. We know the underlying process and can simply predict lifespans for living leaders using the parameters.

Of course, we don't know the parameters. But we can learn the parameters from the training data and use them to predict the outcome. In Bayesian statistics this quantity is called the posterior-predictive distribution. We are interested in describing \(p(Y_i|X_i,Y^{obs})\), our beliefs or uncertainty about \(Y_i\) from the hold-out set given our observed data \(Y^{obs}\). Omitting \(X_i\) for clarity, this quantity can be analytically expressed as the following:

\[p(Y_i|Y^{obs}) = \int p(Y_i|\beta,\sigma^2) p(\beta,\sigma^2|Y^{obs}) \ d\beta d\sigma^2\]

Essentially, \(p(Y_i|Y^{obs})\) is a weighted average of the assumed distribution, in this case log-normal, over the parameter space with parameter weights determined by their posterior density. \(p(\beta,\sigma^2|Y^{obs})\) is the posterior distribution for the parameters given only the training data. Given samples from the posterior, one can sample from \(p(Y_i|\beta,\sigma^2)\) to approximate the posterior predictive distribution.

If you know nothing about \(Y_i\) then the hold-out method is correct and really the only option. You can't learn from observations \(i\) where you don't know anything about the outcome.

However, for survival data we do know something about the outcome. We know living leaders will live to at least their current age. The you really want to estimate \(p(Y_i|X_i,Y^{obs},\mathbf{Y_i >c_i})\) where \(c_i\) is the living leader's current age. You wouldn't want to predict Jimmy Carter would only live to be 94 because he is currently 96. The expression from above becomes the following:

\[p(Y_i|Y^{obs},Y_i>c_i) = \int p(Y_i|\beta,\sigma^2,Y_i>c_i) p(\beta,\sigma^2|Y^{obs},Y_i>c_i) \ d\beta d\sigma^2\]

The key difference is the term \(p(\beta,\sigma^2|Y^{obs},Y_i>c_i)\). This is still a posterior distribution but it is not the same posterior distribution as before because it includes the information on additional leaders who have lived at least \(c_i\) years. There are five currently-living Presidents and the fact that they have reached their current ages should inform your beliefs about word leader life expectancy. If you use the hold-out method, you might predict a currently-living leader will die in the past, which is obviously wrong. If you simply force your predictions to predict time-of-deaths in the future, then you have trained your model on incomplete data and used the wrong posterior distribution. You modeled \(Y_i|Y^{obs},Y_i>c_i\) but learned your parameters \(\beta\) and \(\sigma^2\) only from \(Y^{obs}\) rather than \(Y^{obs}\) and \(Y_i>c_i\). I've used Bayesian formulations here because they provide nice ways to estimate survival models and make the distinction between the input data clear, but the logic applies to any estimation of future event times.

Hold-out predictions and cross-validation procedures can be deceptively complex. Your predictive model should replicate how you will actually use it in practice. If you want to predict event times in the future, you should include in your model that those events have not happened yet. Including that information can be hard and may require a more complex estimation of the parameters given the data because the likelihood is now a product of densities \(p(Y_j)\) and survival functions \(P(Y_i>c_i)\).³ But it is certainly the “correct” way to do it because it includes all of the data currently available.

The actual assignment had more components and is available here.↩
The students were tasked with using a more complicated model that is generally better for survival analysis but too complicated for exposition here. The assignment was based on expanding this paper: Stander, J., Dalla Valle, L., and Cortina-Borja, M. (2018). A Bayesian Survival Analysis of a Historical Dataset: How Long Do Popes Live? The American Statistician 72(4):368-375.↩
Of course, if you are a Bayesian, that combination is trivial.↩

Survival Analysis

Why you shouldn’t “hold-out” data in survival-model predictions

Graham Tierney

Statistical Science Ph.D. Student