BDgraph
R package and are called churn
.The values of a survival function \(\hat{S}(t)\) can be estimated from: \[\hat{S}(t_j) = \prod_{k=1}^j \left( 1 - \frac{d_k}{n_k} \right)\]
Where,The calculations can be done using a specified R package (e.g. survival
) or manually.
par(mfrow=c(1, 2))
## Manual Approach
t <- d$Account.Length # time to event
ev <- d$ev # event
es <- ev[order(t)]
ts <- t[order(t)]
di <- table(ts[es==1])
ni <- sapply(unique(ts[es==1]), function(x) sum(ts>=x))
lambda <- di/ni
S <- cumprod(1-lambda)
plot(as.integer(names(S)), S, type="s", lwd=3, col="red",
xlab="Time [days]", ylab="Survival Rate",
ylim=c(0, 1), xlim=c(0, 250))
legend("bottomleft", "Manual Computation", lwd=3, col="red", cex=.7)
mtext("Kaplan-Meier Survival Function Estimate", cex=.7)
## survival package
fit1 <- survfit(Surv(time = d$Account.Length, event=ev)~1)
plot(fit1, lwd=1, col="blue",
xlab="Time [days]", ylab="Survival Rate")
legend("bottomleft", c("survival package", "95% CI"), lty=c(1, 3),
col="blue", cex=.7)
mtext("Kaplan-Meier Survival Function Estimate", cex=.7)
Kaplan-Meier Survival Function Estimates
The median time-to-churn is the time when the value of the survival function is equal to 0.5
.
# Manual
median.idx <- which.min(abs(S - 0.5))
print(names(median.idx))
## [1] "201"
# survival package
print(fit1)
## Call: survfit(formula = Surv(time = d$Account.Length, event = ev) ~
## 1)
##
## n events median 0.95LCL 0.95UCL
## 3333 483 201 188 NA
We can get a more accurate result if we fit the model using one of the time-to-event probability density functions, and use the analytical solution to find the median value. To fit the survival model based on the Weibull distribution, we can use the survival
package, or for educational reasons, we can try to replicate the algorithms manually. To do it, we need to construct the likelihood function where each of the observations contributes in the following ways, under different censoring conditions:
For brevity, let’s assume that the event at a certain time is fully observed (0), right censored (1) or left-censored (2). In that case, the log-likelihood of the jth tuple \((time_j, event_j) = (t_j, y_j)\) can be expressed as:
\[ l_j = (a+by_j + cy_j^2)\log{f(t_j)} + (d+ey_j + fy_j^2)\log(1-F(t_j))+(g+hy_j+iy_j^2)\log{F(t_j)} \]
The coefficients \(a, b, c, d, e, f, g, h, i\) used to select the corresponding term can be calculated by solving:
\[ \left[\begin{matrix} a & b & c \\ d & e & f \\ g & h & i \end{matrix}\right] \times \left[\begin{matrix} 0^0 & 1^0 & 2^0 \\ 0^1 & 1^1 & 2^1 \\ 0^2 & 1^2 & 2^2 \end{matrix}\right]= \left[\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}\right] \]
The solution is:
\[ \left[\begin{matrix} a & b & c \\ d & e & f \\ g & h & i \end{matrix}\right] = \left[\begin{matrix} 1 & -1.5 & 0.5 \\ 0 & 2 & -1 \\ 0 & -0.5 & 0.5 \end{matrix}\right] \]
The log likelihood function can be implemented in the following way:
# Log-likelihood
llk <- function(theta) {
alpha <- theta[1]
beta <- theta[2]
# fully observed
f <- dweibull(t, alpha, beta, log=TRUE)
# f <- log(alpha) - alpha*log(beta) + (alpha - 1)*log(t) - (t/beta)^alpha
# right censored
rc <- log(1 - pweibull(t, alpha, beta, log=FALSE))
# rc <- -(t/beta)^alpha
# left censored
lc <- pweibull(t, alpha, beta, log=TRUE)
# lc <- log( 1 - exp(-(t/beta)^alpha))
sum( (1 - 1.5*ev + 0.5*ev^2)*rc + (2*ev-ev^2)*f + (-0.5*ev + 0.5*ev^2)*lc)
}
More computationally efficient version written in C++ using the Rcpp library:
double llkC(NumericVector theta, NumericVector time, NumericVector event)
{
NumericVector params = theta;
NumericVector t = time;
NumericVector e = event;
double alpha = params(0);
double beta = params(1);
int N = t.size();
NumericVector f(N);
NumericVector rc(N);
NumericVector lc(N);
NumericVector e_sq(N);
NumericVector pow_t_beta(N);
pow_t_beta = pow(t/beta, alpha);
e_sq = pow(e, 2);
f = log(alpha) - alpha*log(beta) + (alpha - 1)*log(t) - pow_t_beta;
rc = -pow_t_beta;
lc = log(1 - exp(-pow_t_beta));
double llk = sum((1-1.5*e+0.5*e_sq)*rc + (2*e - e_sq)*f + (-0.5*e + 0.5*e_sq)*lc);
return(llk);
}
The parameter estimates which maximise the log-likelihood of the observed data can be obtained from the optim
function.
# alpha0 = 1, beta0 = 100
theta0 <- c(1, 100)
#optimise
fit <- optim(theta0, llkC.wrap, method="BFGS", control=list(fnscale=-1), hessian = TRUE)
alpha <- fit$par[1]
beta <- fit$par[2]
# median
m.churn <- beta * (log(2))^(1/alpha)
print(m.churn)
## [1] 198.6128
df.par <- cbind(fit$par, diag(sqrt(-1/fit$hessian)))
colnames(df.par) <- c("MLE", "sd")
rownames(df.par) <- c("$\\hat{\\alpha}$", "$\\hat{\\beta}$")
knitr::kable(df.par, digits=3, caption="Weibull model MLE estimates")
MLE | sd | |
---|---|---|
\(\hat{\alpha}\) | 2.811 | 0.065 |
\(\hat{\beta}\) | 226.268 | 3.662 |
The survival
package includes the survreg
function, which automates all the calculations. Its results are nearly identical:
sr.fit <- survreg(Surv(t, ev) ~ 1, dis="weibull")
sr.vec <- c(1/sr.fit$scale, exp(sr.fit$coefficients))
names(sr.vec) <- c("alpha", "beta")
sr.vec
## alpha beta
## 2.811443 226.268415
Once we know the shape and rate parameters of the Weibull distribution, we can calculate the median from the closed-form equation:
\[\text{median} = \beta \log(2)^{1/\alpha} \approx \hat{\beta} \log(2)^{1/\hat{\alpha}} = 198.6128\]
The dataset contains 19 covariates. Three of them are factorised, and the rest are numerical. One of the factorised covariates (State
) has 51 levels and can be removed if we consider the national scope. Let \(x_i = (x_{i1}, x_{i2}, \ldots, x_{i18})\) be the vector of covariates recorded for survival time \(t_i\). Then, the Cox Proportional Hazards Model defines the hazard function for the ith observation as:
\[ \lambda(t_j|x_i) = \lambda_0(t) exp \left( \sum_{j=1}^{18}\beta_j x_{ij}\right) \] where \(\beta\) is a vector or regression parameters.
Function coxph
from the survival
package allows us to check, which covariates have a statistically significant effect on the hazard.
cx <- coxph(Surv(time=Account.Length, event=ev) ~ . - State - Churn, data=d)
summary(cx, conf.int=FALSE)
## Call:
## coxph(formula = Surv(time = Account.Length, event = ev) ~ . -
## State - Churn, data = d)
##
## n= 3333, number of events= 483
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Area.Code 4.256e-04 1.000e+00 1.077e-03 0.395 0.692768
## Int.l.Planyes 1.201e+00 3.323e+00 1.038e-01 11.573 < 2e-16 ***
## VMail.Planyes -2.142e+00 1.174e-01 4.845e-01 -4.421 9.81e-06 ***
## VMail.Message 4.356e-02 1.045e+00 1.495e-02 2.914 0.003572 **
## Day.Mins -9.006e-01 4.063e-01 2.667e+00 -0.338 0.735576
## Day.Calls 1.323e-03 1.001e+00 2.242e-03 0.590 0.555070
## Day.Charge 5.351e+00 2.108e+02 1.569e+01 0.341 0.732999
## Eve.Mins 8.643e-01 2.373e+00 1.344e+00 0.643 0.520337
## Eve.Calls 1.562e-03 1.002e+00 2.342e-03 0.667 0.504804
## Eve.Charge -1.012e+01 4.035e-05 1.582e+01 -0.640 0.522412
## Night.Mins -1.230e-02 9.878e-01 7.374e-01 -0.017 0.986693
## Night.Calls 1.335e-03 1.001e+00 2.416e-03 0.552 0.580670
## Night.Charge 3.415e-01 1.407e+00 1.639e+01 0.021 0.983371
## Intl.Mins -4.326e+00 1.322e-02 4.402e+00 -0.983 0.325763
## Intl.Calls -7.365e-02 9.290e-01 2.121e-02 -3.473 0.000514 ***
## Intl.Charge 1.622e+01 1.110e+07 1.630e+01 0.995 0.319730
## CustServ.Calls 3.075e-01 1.360e+00 2.745e-02 11.201 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Concordance= 0.772 (se = 0.012 )
## Likelihood ratio test= 443.1 on 17 df, p=<2e-16
## Wald test = 468.6 on 17 df, p=<2e-16
## Score (logrank) test = 498.6 on 17 df, p=<2e-16
The only statistically significant results (at \(\alpha=0.05\)) are:
Description | exp(Coefficient) | |
---|---|---|
Int.l.Planyes | Categorical, yes or no, international plan. | 3.323 |
VMail.Planyes | Categorical, yes or no, voice mail plan. | 0.117 |
VMail.Message | Count, number of voice mail messages. | 1.045 |
Intl.Calls | Count, total number of international calls. | 0.929 |
CustServ.Calls | Count, number of calls to customer service. | 1.360 |
The parsimonious model is relying only on those 5 covariates.
fit.pars <- survreg(Surv(time=Account.Length, event=ev) ~
Int.l.Plan + VMail.Plan + VMail.Message + Intl.Calls + CustServ.Calls,
data=d)
summary(fit.pars)
##
## Call:
## survreg(formula = Surv(time = Account.Length, event = ev) ~ Int.l.Plan +
## VMail.Plan + VMail.Message + Intl.Calls + CustServ.Calls,
## data = d)
## Value Std. Error z p
## (Intercept) 5.55418 0.04889 113.61 < 2e-16
## Int.l.Planyes -0.44955 0.03952 -11.38 < 2e-16
## VMail.Planyes 0.63196 0.16890 3.74 0.00018
## VMail.Message -0.01287 0.00524 -2.45 0.01411
## Intl.Calls 0.02264 0.00742 3.05 0.00227
## CustServ.Calls -0.10816 0.01045 -10.35 < 2e-16
## Log(scale) -1.03396 0.03577 -28.91 < 2e-16
##
## Scale= 0.356
##
## Weibull distribution
## Loglik(model)= -3244 Loglik(intercept only)= -3383.9
## Chisq= 279.77 on 5 degrees of freedom, p= 2.2e-58
## Number of Newton-Raphson Iterations: 8
## n= 3333
The baseline Weibull parameters of the parsimonious model are slightly different from the full model.
pars.vec <- c(1/fit.pars$scale, exp(fit.pars$coefficients[1]))
names(pars.vec) <- c("alpha", "beta")
pars.vec
## alpha beta
## 2.812171 258.314657
Exponentiated coefficients either increase the hazard (if they are \(>1\)) or decrease it (if they are \(<1\)).
The Weibull distribution proved to be a good fit for the model. The scale parameter is equal \(\alpha = 2.811 \pm 0.065\) and is very unlikely to be \(1\). Hence, the Weibull distribution fits the data better than the exponential.
The median time-to-churn is 201 days. This period is slightly larger than 6 months (182 days). Considering the fact that most of the mobile network operators offer 12 or 24 months contract, it could mean that half of the customers will leave before it ends.
There are 5 covariates that can be used in the parsimonious model. International Plan increases the hazard of the customer leaving three-fold. It is an interesting observation because people who make more International Calls are more likely to stay longer. One explanation of this paradox may be that some customers go for more expensive plans (which include international calls), to get better phones for less. Later, they are either unable or unwilling to pay, and the contract is terminated more quickly.
The fact may partially support this hypothesis, that similar thing is happening with Voice Mail Plan and Voice Mail Message. Even though customers who have the voice mail plan activated are 88.3% less likely to go at any time, every message left in their voice mail increases the hazard by 4.5%. Voice mail is targeted towards people, who cannot answer the phone calls at any given moment, probably because they are busy making money. If a customer has many voice mail messages, it could mean that they have either shared the number with someone they would like to avoid, or they never intended to use that number. In the latter case, messages are most likely left by the debt collectors.
Unsurprisingly, contacting the customer service increases the hazard of leaving by 36% for every phone call made. What is surprising is how many chances the customers give to the operators before they decide to go. Culturally dictated common sense would give the operator three opportunities before the issue is sorted. Still, here we see strong evidence that they give four chances before they decide to leave. After the fourth phone call to the customer service, every subsequent call increases the hazard of going five-fold.
d$cs <- d$CustServ.Calls > 4
coxph(Surv(t, ev) ~ d$cs)
## Call:
## coxph(formula = Surv(t, ev) ~ d$cs)
##
## coef exp(coef) se(coef) z p
## d$csTRUE 1.6135 5.0205 0.1365 11.82 <2e-16
##
## Likelihood ratio test=95.03 on 1 df, p=< 2.2e-16
## n= 3333, number of events= 483