You are on page 1of 8

Outlier Detection in Survival Analysis

Joao Diogo Pinto


joao.pinto@tecnico.ulisboa.pt

Keywords: Survival analysis, outlier detection, robust regression, Cox proportional hazards, concordance c-index

Abstract: Outlier detection is an important task in many data-mining applications. In this paper, we present two para-
metric outlier detection methods for survival data. Both methods propose to perform outlier detection in a
multivariate setting, using the Cox regression as the model and the concordance c-index as a measure of good-
ness of fit. The first method is a single-step procedure that presents a delete-1 statistic based on bootstrap
hypothesis, testing for the increase in the concordance c-index. The second method is based on a sequential
procedure that maximizes the c-index of the model using a a greedy one-step-ahead search. Finally, we use
both methods to perform robust estimation for the Cox regression, removing from the regression a fraction of
the data by their measure of outlyingness. Our preliminary results on three different datasets have shown to
improve the estimation of the Cox Regression coefficients and also the model predictive ability.

1 INTRODUCTION their true value (Rousseeuw and Leroy, 1987).


Goal 3) will be the focus of this study, in particu-
lar, our goal is to improve the Cox regression estima-
Survival analysis is the field that studies time-to- tion by identifying and removing outlying observa-
event data and has become a relevant topic in clinical tions. This way the regression becomes more robust,
and medical research. Usually there are three main thus providing more accurate relationships between
goals when performing survival analysis (David G. explanatory variables and survival times, along with
Kleinbaum, Mitchel Klein, 2005): 1) to estimate sur- improving the global model predictive ability.
vival/hazard functions from the data; 2) to compare
survival/hazard functions between groups of patients;
and 3) to assess the impact of explanatory variables on
patients survival time. Goals 1) and 2) are dealt by re- 2 OUTLIERS IN SURVIVAL DATA
curring to non-parametric methods like Kaplan-Meier
and Nelson-Aalen estimators in order to estimate sur- To fix notation, a dataset will be denoted by
vival curves. Log-rank tests are commonly used to X1 , X2 , ..., Xn and Y1 ,Y2 , ...,Yn with each Xi being a
compare survival curves. All these methods have p-dimensional vector of covariates and Yi the corre-
good robustness to the presence of outlying observa- sponding dependent variable value. In survival data,
tions. When modeling the data in relation to explana- is very common the occurrence of censoring, i.e., the
tory variables, the most popular method is the Cox event of interest does not always occur for a given in-
proportional hazards (Cox, 1972). The robustness of dividual during the period of the study. To model cen-
the Cox regression has shown to be rather weak,with soring, it is common to add a binary variable which
outlying observations severely affecting the Cox re- indicates if the event occurred or not.
gression coefficients. Concerning robustness, one im- There are many definitions of an outlier in the lit-
portant concept is the breakdown point (Donoho and erature, both mathematical and more informal, as can
Huber, 1983; Hampel, 1971), which represents the be seen more thoroughly in (Ben-Gal, 2005). For ex-
fraction of corrupt observations needed to arbitrar- ample (Hawkins, 1980) defines an outlier as an obser-
ily offset the estimation values. It has been pointed vation that deviates so much from other observations
out that Cox partial likelihood estimator has a break- as to arouse suspicion that it was generated by a dif-
down point of n1 (Kalbfleisch and Prentice, 2011), this ferent mechanism or (Johnson et al., 1992) that de-
means that when fitting a Cox regression to n data fines an outlier as an observation in a dataset which
points, one single outlier observation is enough to appears to be inconsistent with the remainder of that
cause the estimator to take values arbitrarily far from set of data. These definitions provide two different
ways of detecting outliers: the first one considers only absence of masking, they employ residual analysis as
the values of Xi and Yi , the second. assesses the rela- in (Nardi and Schemper, 1999) to perform outlier de-
tion between them by introducting the notion of the tection. The robust fitting is done using an algorithm
models quality of fit. Of course the second notion of that maximizes the maximum partial likelihood. This
outlyingness needs a model to define this relationship maximization is made over all possible subsets of the
between Yi and Xi . The first perspective corresponds trimmed set of observations.
to a non-parametric approach to outlier detection, the
second corresponds to a parametric or model-based
perspective and will be the focus of the our proposal. 2.3 Concordance c-index

2.1 Swamping and Masking To assess the predictive ability of a survival model, we
will use Harrels concordance c-index (Harrell et al.,
Data sets with multiple outliers or clusters of outliers 1982). It measures the ability of the model to predict
are subject to masking and swamping effects. Here we a higher relative risk to an individual whose event oc-
enunciate the following definitions (Acuna and Ro- curs first. The relative risk is estimated from the out-
driguez, 2004): put of the model for each individual; in a Cox model
Masking Effect One outlier masks another outlier if for instance, the relative risk corresponds to the haz-
the second outlier can be considered an outlier ard ratio. The c-index is calculated using the follow-
only by itself but not in the presence of the first ing procedure:
outlier.
1. Form all possible pairs of individuals.
Swamping Effect One outlier observation swamps a
second observation if the latter can be considered 2. Omit the pairs whose shorter survival time is cen-
as an outlier in presence of the first but not by it- sored and all pairs where both observations are
self. censored. These are the permissible pairs, being
As seen in (Fischler and Bolles, 1981), these ef- N permissible its cardinality.
fects are particularly harmful when developing se- 3. To calculate Concordance, for each permissible
quential procedures for outlier detection, mainly be- pair when Ti 6= T j : count 1 if the shorter survival
cause the subset of observations already deleted influ- time has higher predicted risk, count 0.5 other-
ences which observations will be deleted in the sub- wise. For Ti = T j and both not censored: count
sequent iterations. 1 if the predicted risks are the same, 0.5 other-
wise; if at least one is censored and it corresponds
2.2 Model-Specific Outlier Detection: to a lower risk, count 1 (0.5, otherwise). Concor-
Cox Proportional Hazards dance is defined as the sum of all counts for each
permissible pair.
In this paper the model chosen to represent the data
was the Cox proportional hazards due to its simplicity, 4. The c-index is given by
good results and great power of interpretability. c-index = Concordance/N permissible .
Several works have been developed to increase the
robustness of the estimation of the Cox Regression The c-index is a rank measure, thus it only mea-
by performing outlier detection, for example through sures how well predicted values are concordant with
residual analysis, estimating the variation in regres- rank-ordered response variables. For example, the c-
sion parameters with the removal of a given observa- index for two patients with predicted hazard ratios of
tion (Therneau et al., 1990). The outliers can then be 0.4 and 0.6 is the same as if the patients had hazard
detected by selecting the observations that cause the ratios of 0.1 and 0.9 (Harrell, 2001), it only measures
largest variation in the parameters upon its removal. if the outcome is concordant with the response vari-
This approach is susceptible to masking and swamp- ables or not. Thus, unlike measures such as the sum of
ing and also needs the tuning of the outlier or non- squared errors, one observation by itself has a limited
outlier threshold. contribution for the overall concordance. This robust-
In (Farcomeni and Viviani, 2011) outlying obser- ness may allow for the maximization of the c-index
vations are defined as the individuals that have the without worrying if it is being maximized at the cost
smallest contributions to the Cox partial likelihood. of the majority of the data, only to fit better one or a
In order to find these observations they first make a cluster of outlying observations, as it can happen with
robust fitting of the Cox regression and then in the the sum of squared errors (Fischler and Bolles, 1981).
3 METHODS FOR OUTLIER These confidence intervals will be computed us-
DETECTION ing Monte Carlo Bootstrap as explained in (Harrell,
2001), for each observation i the procedure is the
We propose three novel methods for outlier detec- following: 1) produce B bootstrap samples by sam-
tion in survival data based on the concordance index, pling with replacement n 1 observations from the
described in sections 3.1 and 3.3. Section 3.4 de- empirical distribution Datai ; 2) compute the concor-
scribes alternative proposals that will be further used dance for each bootstrap sample; 3) the p-value corre-
for comparison purposes. sponds to the proportion of bootstrap samples having
The proposed methods make use of an operational Ci Coriginal 0.
definition of outlier, defined as an observation that, The number of bootstrap samples B used has
when absent from the data, will likely decrease the shown to be dependent on the number of individuals
prediction error of the fitted model. In a survival set- and number of covariates. In our tests the value for B
ting, this prediction error will be measured recurring was iteratively increased until p-values convergence.
to the concordance c-index, which has the particular- Following the same reasoning provided in (Singh
ity of using the predictive model as a black-box. and Xie, 2003), given an outlying observation the
probability that a bootstrap sample does not contain
3.1 Bootstrap Hypothesis Testing (BHT) is approximately (1 n1 )n 1e ( 37%) as n .
Thus, each observation will be absent in approxi-
Ideally we would know the underlying distribution of mately 37% of the samples. A low p-value for the
the observations Xi ,Yi and perform an hypothesis test hypothesis test mentioned above, means that the given
about the difference in terms of concordance between observation i improves the concordance c-index in a
the two distributions. Thus the idea is to perform n hy- systematic way not depending on the cooperation of
pothesis tests about the concordance variation, one for any other observation. On the other hand, if one out-
each observation i, and sorting the resulting p-values. lier is masked by another, the masking outlier will
The hypothesis tests will be made following the not be present in approximately 37% of the bootstrap
bootstrap approach (Efron, 1979). Each observation samples and thus we can expect a multimodal be-
Xi ,Yi is considered a discrete random variable hav- havior for the expected C. Thus an outlier subject
ing a distribution equal to the empirical distribution to masking may not systematically improve concor-
given by the original dataset. We will consider n dif- dance (present a high p-value for the hypothesis test)
ferent empirical distributions, each distribution results but if presents multimodality and one of the modes is
from removing each observation i from the original relatively high, it is a candidate for an outlier.
data and adjust densities in order to sum one. De- To sum up, Bootstrap Hypothesis testing (BHT)
noting by C the concordance c-index and Coriginal the on C works as follows: for each observation, an hy-
concordance in the original data, distributions Datai pothesis test by bootstrap is done. The resulting statis-
represent the adjusted empirical distributions having tics for each observation will be a p-value and the ex-
P(X = Xi ,Y = Yi ) = 0. The hypothesis test for each pected value of C. The p-value gives us the confi-
observation is formulated as follows: dence level to reject the hypothesis that the removal
of the observation causes no increase in the c-index.
H0 : CModel,(X,Y )Datai Coriginal Experimentally we verified that these two values are
H1 : CModel,(X,Y )Datai > Coriginal correlated. When the p-value is low, the expected C
is usually very high, the opposite relation has shown
Writing CModel,(X,Y )Datai and Ci = Ci Coriginal
to be weaker. So in order to obtain a 1-dimensional
it is more useful to reformulate the hypothesis tests
metric for outlyingness, we consider the observations
as:
with the lowest p-values the more outlying ones.

H0 : Ci 0
3.2 Dual Bootstraps Hypothesis Test
H1 : Ci > 0
The rejection of the null hypothesis given a signifi- This method aims to improve the approach taken in
cance level corresponds to estimate a confidence in- the BHT method. In the BHT method, removing one
terval for the values of C for each distribution Datai , observation from the dataset, and then assess the im-
if this interval does not contain values less or equal pact of each removal on concordance has an unde-
than zero we can reject the null hypothesis for the sig- sired effect. Since the model has less observations
nificance level , alternatively we can calculate the to fit (observation under test is removed), there is a
test p-value. tendency for the concordance to increase, this poten-
tially introduces confusion in the hypothesis test made N(t) be the number of events until t and H(t) the cu-
in BHT, in particular it may increase the number of mulative hazard function, we have for each individi-
false positives. The rationale behind DBHT is to ual i the Martingale residual process:
generate two histograms from two antagonistic ver-
sions of the bootstrap procedure: the poison and anti- Mi (t) = Ni (t) Hi (t). (1)
dote bootstraps and then compare them. The antidote The martingale residual is defined as the value of pro-
bootstrap excludes the observation under test from ev- cess Mi (t) at the time of failure/censoring, as N(t)
ery bootstrap sample, as the poison forces the obser- takes 1 if the event is observed and zero when cen-
vation being tested into every bootstrap sample. We sored (David Collett, 2003), their are given by:
consider Cantidote and C poison as two real random
variables corresponding to the concordance variation rMi = i Hi (t), (2)
in relation to the original dataset for an antidote boot- where i is the censoring indicator for individual i.
strap sample and a poison bootstrap sample.Then we For the Cox model the residuals are given by:
perform the following hypothesis test:
rMi = i exp{X}H0 (t). (3)
H0 : E [Cantidote ] > E [C poison ] ;
H1 : E [Cantidote ] E [C poison ] ; 3.4.2 Deviance residuals (DEV)

again we believe that if observation i is an outlier, The deviance residuals are an attempt (David Collett,
when removed, the concordance of the samples tends 2003) to adjust the Martingale residuals to be more
to increase. Similar to the BHT method, besides a sur- centered around zero, given by:
vival model and the input dataset D, DBHT only takes 1
one input parameter: B the number of bootstrap sam- rDi = sgn(rMi )[2{rMi + i log(i rMi )}] 2 . (4)
ples used on the antidote and poison bootstrap pro-
cedures. The DBHT method is a soft-classifier and 3.4.3 Likelihood displacement statistic (LD)
single-step method. Thus the output is an outlying
score for each observation, from this, one can extract Let be the value of that maximizes the partial Cox
the k most outlying observations. likelihood and (i) the estimate when observation i is
eliminated from the fitting. The likelihood displace-
3.3 One-step deletion (OSD) ment (Cook, 1977) statistic (LD) is given by:
2logL( (i) ).
LDi = 2logL() (5)
This method is a sequential procedure for outlier re-
moval. We start with all data and at each itera- Under the null hypothesis (i) = the LD statis-
tion of the algorithm, the observation that, when ex- tic follows a chi-square distribution with one degree
cluded, causes the largest increase in concordance, is of freedom. Therefore we calculate the p-value for
removed. The resulting subset is interpreted as con- this test for all observations, the ones having more
taining the most outlying observations. This method significance are considered the most outlying ones.
is equivalent to do one-step-ahead greedy search for
maximizing the c-index of the model in the data. The
resulting subset of observations, will be considered
the most outlying ones. 4 DATASETS

3.4 Alternative methods 4.1 Simulation Data (SIM)

Here we present alternative methods for outlier de- Similarly to the simulation data in (Farcomeni and
tection in survival data that will be used to assess the Viviani, 2011), we will generate datasets having as
performance of the proposed methods. underlying probabilistic model, the Cox proportional
hazards. Our goal is to recreate a realistic setting,
3.4.1 Martingale residuals (MART) with survival times and covariates as similar as real
datasets. In order to approximate this conditions,
These residuals are provenient from the counting pro- each simulated dataset will have a pure model G that
cess framework for censored survival, first a Martin- translates a a general trend of the observations, and
gale process is defined by the difference between ob- two other Cox models with different parameter val-
served and expected number of events (David W. Hos- ues. Each dataset consists in 100 observations hav-
mer, Stanley Lemeshow, Susanne May, 2008). Let ing covariates X1 , X2 , X3 . These follow a 3-D normal
distribution with zero mean and covariance matrix , OSD and we compare their results with MART, DEV
that will be equal to the identity matrix I. For the sur- and LD. We start by presenting the configuration of
vival times, the probabilistic model for the hazard of our simulation study for outlier detection. Then we
each individual follows one of two possible models: apply all methods to two real datasets, performing
0
the pure model G and one outlier model . Having outlier detection. We further use the detected outliers
k < n outliers, the hazard function for each observa- to perform a robust Cox regression by removing them
tion i is generated by: from the data, the coefficients and p-values of the re-
( gression will be compared.
h0 (t) exp{X} 1 i n k
hi (t) = 0 . (6)
h0 (t) exp{ X} n k < i n k 5.1 SIM dataset
Three baseline hazards h0 (t) are used, and given by a
Weibull with the parameter combinations: = 1, = The outlier detection methods will be used on sim-
1, = 1.5, = 0.5, and = 0.5, = 1.5 correspond- ulated datasets generated using the methodology de-
ing respectively to a constant, strictly decreasing, and scribed in Section 4.1. In order to test the outlier de-
strictly increasing hazard functions. The value for k tection methods in a variety of conditions for the out-
will be set in order to have 10% of outliers. The esti- lying models and for the general model, we will fix
mation of the cumulative hazard function Hi (t) is then the general trend model = (1, 1, 1) and use a set of
obtained: configurations for the source of outlying observations.
Z t
Each parameter for the outlier source is given by a
Hi (t) = hi ()d. (7) three dimensional normal distribution with a diagonal
0
From each Hi (t) we further calculate the correspond- covariance matrix, the different values for the outlier
ing survival curves by Si (t) = eHi (t) . Having this dis- generator model are presented in Table 1. Although
tribution, we generate 100 survival times according to Table 1: The different outlier configurations used in the sim-
the distribution given by Si (t) and generate a censor- ulation data. The pure model is G =(1,1,1).
ing vector c1 , .., c100 following a Bernoulli with prob- 0 0 0
ability p, corresponding to the proportion of censored # || ||/||G ||
observations, typically a value around 0.2: 1 180 1 (-1,-1,-1)
ti 1 Si (t), (8) 2 180 0.2 (-0.2,-0.2,-0.2)
ci Bernoulli(p). 3 180 5 (-5,-5,-5)
4 135 0.2 (-0.143,0,-0.283)
4.2 Clinical Data 5 135 5 (-3,6,0,-7.07)
6 90 0.2 (-0.245,0,-0.245)
In order to test the procedures in a more realistic set- 7 90 5 (6.12,0,-6.12)
ting, we have further applied the methods to real clin- 8 0 0.2 (0.2,0.2,0.2)
ical data, focusing on two studies: 9 0 5 (5,5,5)
WHAS Dataset from the Worcester Heart At- 10 180 10 (-10,-10,-10)
tack Study, with 100 individuals each with 11 0 10 (10,10,10)
5 covariates. This data concerns the sur- 12 135 10 (-7.15,0,-14.15)
vival times of patients having their first
heart attack. Data publicly available at
the outlying values for the parameters may seem close
https://www.umass.edu/statdata/statdata/data/.
to the general trend model it is worth noting that the
BMT Bone Marrow Transplant Data (Klein and Cox model defines the hazards as an exponential func-
Moeschberger, 1997): contains data about 137 tion of X, thus the ratio between the hazard of an
leukemia patients each with 10 covariates. The outlying and a general trend observation is given by
0
data concerns the survival time after the bone mar- exp{ X X}. The reasons behind the choice of this
row transplant. It is publicly available in the R (R set of scenarios is to have a variety of combinations
Development Core Team, 2006) package KMsurv. with different norms and contrasting parameters.
Table 2 reports the true positive rate (TPR) of the
first 10 most outlying observations according to each
5 RESULTS AND DISCUSSION method.
By inspecting Table 2 we see that the DBHT at-
In this section we assess the performance of the tains the overall best performance, overcoming the
two proposed outlier detection methods BHT and other methods in 9 out of 12 of the scenarios. For
Table 2: Average of top-k TPR grouped by outlier scenarios.
they consider as the most outlying, for example, all
Scenario # MART DEV LD DFB OSD BHT DBHT
1 0.29 0.36 0.43 0.36 0.47 0.43 0.47
the methods identified observations 1 and 56 in their
2 0.22 0.25 0.31 0.29 0.32 0.31 0.34 top-10 outliers.The estimates for the regression coef-
3 0.50 0.58 0.59 0.52 0.63 0.59 0.65 ficients when fitting the Cox model to all observations
4 0.22 0.23 0.30 0.28 0.30 0.29 0.32
5 0.44 0.54 0.52 0.48 0.58 0.53 0.58 are given in Table 5.
6 0.21 0.22 0.28 0.26 0.27 0.26 0.28
7 0.40 0.50 0.40 0.41 0.44 0.37 0.42 Table 5: Cox model fitted to the WHAS dataset.
8 0.18 0.18 0.23 0.22 0.22 0.20 0.23
9 0.32 0.36 0.18 0.25 0.09 0.06 0.07 p-value
10 0.53 0.63 0.64 0.57 0.68 0.60 0.70 los -0.022 0.3972
11 0.38 0.46 0.24 0.32 0.14 0.11 0.12 age 0.039 0.0025
12 0.49 0.60 0.54 0.51 0.60 0.52 0.60 gender 0.157 0.6066
bmi -0.071 0.0497
Table 3: Average of AUC grouped by outlier scenarios.

Scenario # MART DEV LD DFB BHT DBHT We observe that only two covariates are statisti-
1 0.70 0.70 0.74 0.68 0.78 0.82 cally significant corresponding to the age at the first
2 0.65 0.65 0.70 0.64 0.71 0.75 hear attack (age) and the body mass index (bmi). Af-
3 0.80 0.80 0.78 0.77 0.86 0.90 ter trimming the 10% most outlying observations ac-
4 0.64 0.64 0.69 0.63 0.71 0.73 cording to each method (in Table 4), new models were
5 0.78 0.77 0.74 0.75 0.82 0.84
6 0.63 0.63 0.67 0.63 0.68 0.71 obtained (Table 6 and Table 7). The goal is to un-
7 0.76 0.76 0.66 0.73 0.70 0.72 veil a trend model, unaffected by outlying observa-
8 0.62 0.62 0.66 0.62 0.65 0.68 tions. The results show that for BHT, and DBHT
9 0.74 0.72 0.61 0.69 0.60 0.60
10 0.83 0.83 0.80 0.81 0.87 0.92 Table 6: Cox model fit on the WHAS dataset with 10%
11 0.78 0.76 0.61 0.73 0.59 0.61 outlier trimming using the proposed methods.
12 0.80 0.80 0.74 0.78 0.81 0.86
OSD BHT DBHT
Xi i p i p i p
scenarios 9 and 11 the MART residuals have shown los -0.025 0.374 -0.166 0.006 -0.129 0.017
the best performance. In Table 3 we have the AUC age 0.068 0.000 0.0450 0.000 0.0590 0.000
gender 0.042 0.898 0.003 0.992 0.260 0.425
of the methods in each scenario, with the exception bmi -0.133 0.002 -0.162 0.0012 -0.162 0.000
of OSD, as it does not output a score for every ob-
servation, the calculation of AUC is not applicable.
By inspecting table 3 we can again verify that DBHT Table 7: Cox estimates removing the top 10% outlier obser-
outperforms the other methods in the same 9 scenario. vations in the WHAS dataset for methods MART, DEV and
Worth noticing that the performance offset of DBHT LD.
in terms of AUC is fairly larger than when analysing MART DEV LD
p-value p-value p-value
the top-10 TPR. los -0.016 0.498 -0.015 0.550 -0.016 0.506
age 0.045 0.001 0.032 0.012 0.069 0.000
5.2 WHAS Dataset gender -0.082 0.800 0.155 0.653 -0.230 0.483
bmi -0.082 0.029 -0.037 0.030 -0.146 0.001

The outliers detected by the methods in the WHAS


dataset are presented in Table 4. The selection is method the length of stay (los) after the first heart at-
based on the ten lowest p-values. It is worth notice the tack appeared as significant, which did not occur for
the other methods. These results show that the pro-
Table 4: Top 10% outliers detected by the methods in the posed methods can potentially unveil covariates that
WHAS dataset.
were not considered useful, when fitting the model on
Nb. MART DEV LD BHT OSD DBHT all observations.The fact that los rose as a significant
1 93 1 97 67 1 1 covariate in the Cox regression calls for a better anal-
2 51 31 67 1 67 67
3 90 56 1 78 97 97 ysis of this measure. There are several studies that
4 33 85 52 56 51 56 relate the length of hospital stay with patient readmis-
5 11 97 23 69 23 23 sion. Also studied, is the association between los and
6 27 93 7 8 31 90 the quality of hospital care, (Thomas et al., 1996) with
7 40 30 57 45 93 93
8 1 78 78 93 52 8 data for 12 different conditions, that a longer los risk-
9 31 51 56 30 56 78 adjusted for other covariates, is associated with poorer
10 56 90 17 32 57 51 hospital care. In our case we have a negative coeffi-
cient, meaning that the hazard function decreases with
consensus among the methods, on which observations a longer length of stay, thus this might be also a po-
Table 11: Cox estimates removing the top 10% outlier ob-
tential indicator that the hospital has a good quality of servations in the BMT dataset for methods MART, DEV
care. and LD.
MART DEV LD
5.3 BMT Dataset p-value p-value p-value
Age Diagn -0.009 0.640 0.029 0.181 0.006 0.777
Donor Age 0.027 0.149 0.024 0.243 0.050 0.027
The outliers detected by the methods in the BMT Sex -0.443 0.078 -0.624 0.021 -0.325 0.235
Donor Sex 0.053 0.833 0.257 0.345 0.361 0.195
dataset are presented in Table 8. The selection is CMV -0.356 0.178 -0.460 0.094 -0.395 0.148
based, again, on the 10% lowest p-values. For BHT, a Donor CMV -0.432 0.867 0.075 0.771 0.032 0.910
value of bootstrap samples B = 2000 has shown to be Wait Time -0.000 0.866 0.000 0.321 -0.000 0.586
FAB 1.170 0.000 1.286 0.000 1.058 0.000
sufficient for the convergence. The estimates for the Hospital -0.693 0.000 -0.794 0.000 -1.442 0.000
MTX 1.813 0.000 1.495 0.000 2.350 0.000
Table 8: Top 10% outliers detected by the methods in the
BMT dataset.
Nb. MART DEV LD BHT OSD DBHT When using all the data, the statistically signifi-
1 65 129 129 129 129 129 cant covariates are FAB, Hospital and MTX (Table 9).
2 103 35 132 103 132 132
3 99 108 89 99 30 99
When the first top 10% outlier observations were re-
4 97 65 90 65 130 65 moved, the results were very similar between the pro-
5 13 132 26 30 26 130 posed methods as all three severely reduced the p-
6 42 87 30 132 28 103 value of the variable CMV. This possibly reveals that
7 63 84 28 13 65 30
8 40 103 130 130 13 89
the variable CMV is much more significant to the
9 92 30 17 16 103 13 model than first expected using the complete dataset.
10 14 99 105 136 14 28 The covariate CMV represents the cytomegalovirus
11 43 97 136 15 72 14 immune status (positive or negative) and therefore
12 39 28 116 26 89 105
13 49 109 72 97 50 116
might be a relevant feature to predict survival. It is
noteworthy that the other methods did not retrieve this
variable as significant.
regression coefficients when fitting the Cox model to
In all these experiments, the choice of the outlier
all observation are given in Table 9.
percentage threshold has obvious implications on the
Table 9: Cox model fitted to all BMT data. obtained Cox regression coefficients and a more de-
p-value tailed analysis is warranted to analyze the tradeoff be-
Age Diagn -0.0017 0.9357 tween keeping and removing observations.
Donor Age 0.0316 0.1072
Sex -0.2738 0.2651
Donor Sex 0.0409 0.8662 5.4 Leave-one-out cross-validation of
CMV -0.1701 0.4922 the c-index
Donor CMV 0.0038 0.9875
Wait Time -0.0001 0.8701
FAB 0.7917 0.0012 To assess the predictive ability of the model when
Hospital -0.5570 0.0004 facing new observations, we perform leave-one-out
MTX 1.0062 0.0026 cross-validation of the c-index, in these results the
outliers also take part of the several test sets, but they
After removing 10% of the observations indicated are never present in the training set used to fit the
in the Table 8 for each of the methods, new models models, thus this measure also takes into account the
are obtained (Table 10 and Table 11). prediction performance of the model on outlying ob-
Table 10: Cox model estimations with 10% outlier trim-
servations. The leave-one-out values obtained for the
ming using the proposed methods. two real datasets are presented for each method in ta-
OSD BHT DBHT
bles 12, 13, and 14.
Xi i p i p i p The results are very positive for the three meth-
Age Diagn 0.0278 0.1945 -0.0174 0.4183 0.0226 0.2863 ods, with the concordance showing a steady increase
Donor Age 0.0124 0.5390 0.0332 0.0972 0.0288 0.1394
Sex -0.4453 0.0873 -0.4120 0.1148 -0.4235 0.1124 while removing the most outlying observations of
Donor Sex 0.2619 0.3427 0.0760 0.7804 0.1546 0.5717 each dataset.
CMV -0.6221 0.0263 -0.5407 0.0466 -0.5840 0.0378
Donor CMV 0.0597 0.8164 -0.0239 0.9256 0.1857 0.4689 Table 12: Leave-one-out c-indexes for the BHT method.
Wait Time -0.0002 0.7155 0.0001 0.6230 0.0002 0.5920
FAB 1.2761 0.0000 1.2596 0.0000 1.2792 0.0000 Dataset All data top-3 top-10 top-30
Hospital -1.3875 0.0000 -0.9910 0.0000 -1.5141 0.0000 WHAS 0.6607 0.6813 0.6824 0.6900
MTX 3.0082 0.0000 2.1267 0.0000 3.0513 0.0000 BMT 0.6208 0.6314 0.6441 0.6668
Table 13: Leave-one-out c-indexes for the DBHT proce-
dure. David G. Kleinbaum, Mitchel Klein (2005). Survival anal-
ysis: a self-learning text. New York, NY : Springer,
Dataset All data top-3 top-10 top-30 c2005.
WHAS 0.6607 0.6710 0.6807 0.6910
BMT 0.6208 0.6288 0.6462 0.6630 David W. Hosmer, Stanley Lemeshow, Susanne May
(2008). Applied survival analysis: regression mod-
Table 14: Leave-one-out c-indexes for the OSD procedure. eling of time-to-event data. Hoboken, N.J. : Wiley-
Dataset All data top-3 top-10 top-30 Interscience, c2008.
WHAS 0.6607 0.6832 0.6853 0.6986 Donoho, D. L. and Huber, P. J. (1983). The notion of break-
BMT 0.6208 0.6314 0.6441 0.6629 down point. A Festschrift for Erich L. Lehmann, pages
157184.
Efron, B. (1979). Bootstrap methods: another look at the
6 CONCLUSION jackknife. The annals of Statistics, pages 126.
Farcomeni, A. and Viviani, S. (2011). Robust estimation for
We proposed two methods for outlier detection the cox regression model based on trimming. Biomet-
in a survival setting. On simulation data, the pro- rical Journal, 53(6):956973.
posed methods clearly outperformed the alternative Fischler, M. and Bolles, R. (1981). Random Sample Con-
methods, with special focus on DBHT. The three pro- sensus: A Paradigm for Model Fitting with Applica-
tions to Image Analysis and Automated Cartography.
posed methods improve the performance of the Cox Communications of the ACM.
Regression using cross-validation. Overall, DBHT Hampel, F. R. (1971). A general qualitative definition of
has shown promising results in terms of p-value im- robustness. The Annals of Mathematical Statistics,
provement of the regression coefficients. Finally, we pages 18871896.
use both methods to perform robust estimation for the Harrell, F. E. (2001). Regression modeling strategies: with
Cox regression, trimming a certain amount of outliers applications to linear models, logistic regression, and
from the data removing from their measure of out- survival analysis. Springer.
lyingness. Our preliminary results on three different Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L., and
datasets have shown to improve the estimation of the Rosati, R. A. (1982). Evaluating the yield of medical
Cox Regression coefficients and also the model pre- tests. Jama, 247(18):25432546.
dictive ability. Hawkins, D. M. (1980). Identification of outliers, vol-
ume 11. Springer.
Johnson, R. A., Wichern, D. W., and Education, P. (1992).
Applied multivariate statistical analysis, volume 4.
7 Acknowledgments Prentice hall Englewood Cliffs, NJ.
Kalbfleisch, J. D. and Prentice, R. L. (2011). The statistical
This work was supported by national analysis of failure time data, volume 360. John Wiley
funds through Fundacao para a Ciencia e & Sons.
Tecnologia (FCT, Portugal) under contracts Klein, J. and Moeschberger, M. (1997). Survival analysis:
LAETA Pest-OE/EME/LA0022 and IT (PEst- techniques for censored and truncated regression.
OE/EEI/LA0008/2013), as well as project CancerSys Nardi, A. and Schemper, M. (1999). New residuals for cox
regression and their application to outlier screening.
(EXPL/EMS-SIS/1954/2013).
Biometrics, 55(2):523529.
R Development Core Team (2006). R: A Language and
Environment for Statistical Computing. R Foundation
REFERENCES for Statistical Computing, Vienna, Austria. ISBN 3-
900051-07-0.
Acuna, E. and Rodriguez, C. (2004). A meta analysis study Rousseeuw, P. and Leroy, A. (1987). Robust regression
of outlier detection methods in classification. Techni- and outlier detection. Wiley Series in probability and
cal paper, Department of Mathematics, University of mathematical statistics. Wiley, New York [u.a.].
Puerto Rico at Mayaguez. Singh, K. and Xie, M. (2003). Bootlier-Plot: Bootstrap
Ben-Gal, I. (2005). Outlier detection. In Data Mining Based Outlier Detection Plot. Sankhya: The Indian
and Knowledge Discovery Handbook, pages 131146. Journal of Statistics (2003-2007), 65(3):532559.
Springer. Therneau, T. M., Grambsch, P. M., and Fleming, T. R.
Cook, R. D. (1977). Detection of influential observation in (1990). Martingale-based residuals for survival mod-
linear regression. Technometrics, pages 1518. els. Biometrika, 77(1):147160.
Cox, D. R. (1972). Regression Models and Life Tables. Thomas, J. W., Guire, K. E., and Horvat, G. G. (1996). Is
Journal of the Royal Statistic Society, B(34):187202. patient length of stay related to quality of care? Hospi-
David Collett (2003). Modelling survival data in medi- tal & health services administration, 42(4):489507.
cal research. Boca Raton, Fla. : Chapman &amp;
Hall/CRC, c2003.

You might also like