suppressMessages(library(tidyverse))
suppressMessages(library(readxl))
suppressMessages(library(ggmosaic))
suppressMessages(library(caret))
suppressMessages(library(plotly))
suppressMessages(library(randomForest))
suppressMessages(library(RColorBrewer))
suppressMessages(library(lubridate))
suppressMessages(library(janitor))
set.seed(42)
theme_set(theme_minimal())
<- clean_names(read_excel("../../data/Telco_customer_churn.xlsx")) %>%
df_full mutate(churn_label = as.factor(churn_label),
gender = as.factor(gender),
senior_citizen = as.factor(senior_citizen),
partner = as.factor(partner),
dependents = as.factor(dependents),
phone_service = as.factor(phone_service),
multiple_lines = as.factor(multiple_lines),
internet_service = as.factor(internet_service),
online_security = as.factor(online_security),
online_backup = as.factor(online_backup),
device_protection = as.factor(device_protection),
tech_support = as.factor(tech_support),
streaming_tv = as.factor(streaming_tv),
streaming_movies = as.factor(streaming_movies),
contract = as.factor(contract),
paperless_billing = as.factor(paperless_billing),
payment_method = as.factor(payment_method)) %>%
rename(churn = churn_label)
# Remove unnecessary columns and rows with NA
<- df_full[,c(10:29)] %>%
df na.omit()
Understanding and Addressing Customer Churn in Subscription-Based Businesses
In the competitive world of subscription-based businesses, customer retention is just as important—if not more so—than customer acquisition. Every churned customer represents not only a loss in recurring revenue but also a missed opportunity to build a long-term relationship. This makes predicting and addressing customer churn a critical area of focus for companies that rely on subscription models.
Churn prediction is more than just a numbers game; it’s a strategic lever for sustainable growth. By identifying customers at risk of leaving, businesses can take preemptive measures to retain them—whether that means offering a personalized incentive, providing additional support, or addressing specific pain points. Such proactive interventions can transform potential losses into renewed loyalty.
However, the success of any churn mitigation strategy depends on precision. Casting too wide a net—like sending coupons to the entire customer base—can quickly become a costly exercise, eroding margins and diminishing the return on investment. Conversely, an overly narrow approach risks missing at-risk customers altogether. Striking the right balance requires an effective churn prediction model, one capable of accurately identifying the subset of customers most likely to churn.
In this blog, we’ll delve into the data and methods behind churn prediction, exploring how to balance analytical rigor with actionable business strategies to reduce churn and maximize customer lifetime value.
Predicting Churn in the Real World
During my time at Jama Software, I spearheaded efforts to predict and reduce customer churn by integrating insights from application usage data. By analyzing how customers interacted with our platform, we identified key behavioral patterns that distinguished our most engaged users from those at risk of leaving.
Working closely with the product team, we embedded flags within the application’s code to track feature usage. This allowed us to understand which parts of the platform drove value for our best customers and which behaviors signaled potential disengagement. These insights were combined with customer profiles and contract details to create predictive models that flagged accounts at risk of churning.
The resulting insights were invaluable for the Customer Success team, enabling them to proactively engage at-risk customers. Whether it was through personalized outreach, targeted training, or addressing specific pain points, these interventions helped foster stronger customer relationships while reducing churn. This experience reinforced for me the power of data-driven decision-making in addressing complex business challenges.
Exploring Churn with Kaggle Data
While my experience at Jama Software provided firsthand insights into predicting customer churn, the specific data and methods used there remain proprietary. To illustrate similar concepts and methodologies here, I’ll turn to a publicly available dataset from Kaggle, which contains information on over 7,000 cell-phone users.
This dataset offers a rich mix of categorical and quantitative variables, making it an excellent resource for exploring the drivers of churn and the steps businesses can take to address it. By analyzing this data, we can build a clearer picture of the patterns and predictors of churn, as well as demonstrate practical approaches for mitigating it.
Here’s a link to the source dataset: https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset
First, I’m going to setup an environment in R to read in the data and clean it up a bit.
The Data
str(df)
tibble [7,032 × 20] (S3: tbl_df/tbl/data.frame)
$ gender : Factor w/ 2 levels "Female","Male": 2 1 1 1 2 1 2 2 2 2 ...
$ senior_citizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
$ partner : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 2 2 ...
$ dependents : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 1 1 2 1 ...
$ tenure_months : num [1:7032] 2 2 8 28 49 10 1 1 47 1 ...
$ phone_service : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 1 ...
$ multiple_lines : Factor w/ 3 levels "No","No phone service",..: 1 1 3 3 3 1 2 1 3 2 ...
$ internet_service : Factor w/ 3 levels "DSL","Fiber optic",..: 1 2 2 2 2 1 1 3 2 1 ...
$ online_security : Factor w/ 3 levels "No","No internet service",..: 3 1 1 1 1 1 1 2 1 1 ...
$ online_backup : Factor w/ 3 levels "No","No internet service",..: 3 1 1 1 3 1 1 2 3 3 ...
$ device_protection: Factor w/ 3 levels "No","No internet service",..: 1 1 3 3 3 3 3 2 1 1 ...
$ tech_support : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 3 1 2 1 1 ...
$ streaming_tv : Factor w/ 3 levels "No","No internet service",..: 1 1 3 3 3 1 1 2 3 1 ...
$ streaming_movies : Factor w/ 3 levels "No","No internet service",..: 1 1 3 3 3 1 3 2 3 1 ...
$ contract : Factor w/ 3 levels "Month-to-month",..: 1 1 1 1 1 1 1 1 1 1 ...
$ paperless_billing: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 1 2 1 ...
$ payment_method : Factor w/ 4 levels "Bank transfer (automatic)",..: 4 3 3 3 1 2 3 4 3 3 ...
$ monthly_charges : num [1:7032] 53.9 70.7 99.7 104.8 103.7 ...
$ total_charges : num [1:7032] 108 152 820 3046 5036 ...
$ churn : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
- attr(*, "na.action")= 'omit' Named int [1:11] 2235 2439 2569 2668 2857 4332 4688 5105 5720 6773 ...
..- attr(*, "names")= chr [1:11] "2235" "2439" "2569" "2668" ...
We have a dataframe with 7,032 rows and 20 columns. The key variable here at the bottom is named ‘churn’, and that’s a Yes/No factor. Other than churn, we have 16 categorical variables (factors), like gender, senior_citizen, etc. And, we ahve three quantitative variables: tenure in months, monthly charges, and total charges.
Before diving into the predictive modeling phase, it’s standard practice to conduct a thorough analysis of the dataset—a process known as Exploratory Data Analysis (EDA). EDA helps uncover patterns, relationships, and potential anomalies in the data, providing essential context for building accurate and reliable models.
In a full EDA, I’d typically explore the distribution of variables, look for correlations, and assess the quality of the data. This might include visualizing trends, identifying outliers, and understanding how features interact with each other and the target variable (in this case, customer churn).
However, since the primary focus of this blog is on applying logistic regression, I’ll touch only briefly on a few example variables from the dataset to give a sense of its structure. Future posts will dig more deeply into EDA processes.
Gender:
%>%
df group_by(gender) %>%
count(churn) %>%
ggplot(aes(x=gender, y= n, fill = churn)) +
geom_bar(stat = "identity", position = position_dodge()) +
scale_fill_brewer(palette = "Set1") +
labs(y = "Frequency", x = "Gender", fill = "Churn")
First, I’ve plotted the frequency table for Gender and Churn…visually we can quickly see that the Churn result seems about the same for both Genders. We’ll use another method below to look at the relationship, but a graph like this would pretty quickly strike this categorical variable from the interesting list.
Contract Length:
%>%
df group_by(contract) %>%
count(churn) %>%
ggplot(aes(x=contract, y= n, fill = churn)) +
geom_bar(stat = "identity", position = position_dodge()) +
scale_fill_brewer(palette = "Set1") +
labs(y = "Frequency", x = "Contract", fill = "Churn")
In contrast to gender, when I plot a frequency table for “Contract” against “Churn” an interesting relationship seems to jump out. Customers on a one-year or two-year contract seem to churn at a much lower rate than customers on a month-to-month contract. In itself, this isn’t an entirely unexpected observation, but this chart would get the variable “Contract” on the interesting list of variables to dig into further.
Tenure in Months:
%>%
df ggplot(aes(x=tenure_months, fill = churn)) +
geom_density(alpha = .4) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Monthly Tenure by Churn Status", fill = "Churn",
x="Tenure Months", y="")
This plot looks different because the variable “tenure_months” is a qualitative variable so we can look at a whole range of values, not just Yes/No. Here I’ve plotted the number of months each customer has been with the company, with the fill color being whether or not the have churned. Somewhat interestingly, if the customer makes it over a year they appear less and less likely to churn. Again, this is the type of interesting plot that would warrent further examination.
Predicting Churn: An Introduction to Logistic Regression
To predict customer churn, one of the most accessible and widely used techniques is logistic regression. While the name might sound technical, the concept is straightforward: logistic regression is a mathematical method that helps us determine the likelihood of an event happening—in this case, whether a customer will churn.
Think of it like drawing a line through a scatterplot of data points, where each point represents a customer. The goal isn’t just to divide the data into “churners” and “non-churners,” but to estimate the probability that any given customer might churn. These probabilities, which range between 0 and 1, can then guide targeted interventions.
Unlike traditional linear regression, which predicts continuous values like sales or revenue, logistic regression focuses on binary outcomes—yes or no, churn or stay. It does this by applying a special mathematical function that transforms raw data into probabilities, making it ideal for this kind of prediction.
In this blog, I’ll use logistic regression to analyze the Kaggle dataset, identifying the key factors that influence churn and demonstrating how businesses can act on these insights. In future posts, we’ll explore more advanced techniques like Random Forest, but for now, we’ll focus on this foundational and highly effective method.
Selecting Features: Using the Chi-Squared Test for Categorical Variables
Before diving into logistic regression, it’s essential to identify which features (or variables) in our dataset are most relevant to predicting churn. We somewhat started that process above by looking at the variable, but here I’ll describe a somewhat more formal way for selecting variables. For categorical variables, one powerful tool for this is the Chi-Squared test.
The Chi-Squared test helps us determine if there is a meaningful relationship between two categorical variables—in this case, between each feature (e.g., contract type, payment method) and whether or not a customer churned. It works by comparing the observed data (what actually happened) to what we’d expect if there were no relationship at all.
For example, if churn and contract type were completely unrelated, we’d expect customers with monthly contracts to churn at the same rate as those with annual contracts. The Chi-Squared test quantifies the difference between this “expected” scenario and the actual data, providing a numerical result called the p-value. A low p-value indicates that the feature and churn are likely connected and should be considered in our model.
chisq.test(df$contract, df$churn)
Pearson's Chi-squared test
data: df$contract and df$churn
X-squared = 1179.5, df = 2, p-value < 2.2e-16
In this example, the p-value is 0.00000000000000022 … very small, and a factor we should use in our analysis.
By applying this test to each categorical variable, we can narrow down our list of features to those with the strongest potential impact on churn. This step ensures our analysis focuses on the most relevant data, making our predictions more accurate and actionable.
To cycle through each variable, the function below will return a table with the chi-square value and the p-value for each catagorical variable compared up against churn.
# Function to go through and conduct the Chi-Squared test on each factor in the table against Churn
<- function(data, target_var) {
perform_chisq_tests # Ensure the target variable is a factor
if (!is.factor(data[[target_var]])) {
stop("The target variable must be a factor.")
}
# Initialize a list to store the results
<- list()
results
# Iterate through columns in the tibble
for (col in names(data)) {
if (col != target_var && is.factor(data[[col]])) {
# Create a contingency table
<- table(data[[col]], data[[target_var]])
contingency_table
# Perform the chi-squared test
<- tryCatch({
test_result chisq.test(contingency_table)
warning = function(w) {
}, message(paste("Warning for column:", col, "-", w$message))
return(NULL)
error = function(e) {
}, message(paste("Error for column:", col, "-", e$message))
return(NULL)
})
# Store the result if successful
if (!is.null(test_result)) {
<- list(
results[[col]] variable = col,
chi_squared = test_result$statistic,
p_value = test_result$p.value
)
}
}
}
# Convert results to a tibble
<- bind_rows(results, .id = "variable") %>%
results_df select(variable, chi_squared, p_value) # Ensure desired columns
return(results_df)
}
<- perform_chisq_tests(df, "churn") %>%
results arrange(-chi_squared)
results
# A tibble: 16 × 3
variable chi_squared p_value
<chr> <dbl> <dbl>
1 contract 1180. 7.33e-257
2 online_security 847. 1.40e-184
3 tech_support 825. 7.41e-180
4 internet_service 729. 5.83e-159
5 payment_method 645. 1.43e-139
6 online_backup 599. 7.78e-131
7 device_protection 556. 1.96e-121
8 dependents 432. 7.10e- 96
9 streaming_movies 374. 5.35e- 82
10 streaming_tv 372. 1.32e- 81
11 paperless_billing 257. 8.24e- 58
12 senior_citizen 158. 2.48e- 36
13 partner 158. 3.97e- 36
14 multiple_lines 11.3 3.57e- 3
15 phone_service 0.874 3.50e- 1
16 gender 0.475 4.90e- 1
The resulting table ranks our categorical variables by the Chi-squared metric. At the bottom of the list we see “gender”. The earlier plot suggested that their wasn’t an interesting relationship between it and “churn”…this metric confirms that thought.
At the top of the list is “contract”…again, a confirmation of what was shown earlier in the plot.
Selecting Features: Using the t-Test for Quantitative Variables
For quantitative variables, we turn to the t-test, a statistical tool that helps determine whether a measurable feature (like total charges or monthly bill amount) is related to customer churn. The t-test operates within the framework of hypothesis testing, allowing us to systematically assess the significance of relationships in the data.
Here’s how it works: we start with the null hypothesis, which assumes there is no relationship between the quantitative variable and churn—essentially, that the feature’s average value is the same for customers who churned and those who stayed. The t-test then calculates the likelihood that this assumption is true by comparing the means of the two groups.
The result of the test is a p-value, which represents the probability of observing the data if the null hypothesis were correct. A low p-value (typically less than 0.05) suggests that the null hypothesis is unlikely to be true, meaning there is likely a meaningful difference in the variable between churned and non-churned customers.
For example, if customers who churned had significantly higher monthly charges than those who stayed, the t-test would reveal this difference through a low p-value, indicating the variable is worth including in our model. Like the Chi-Squared test for categorical variables, the t-test ensures we focus on the features that are most likely to impact churn, setting the stage for a more accurate logistic regression analysis.
t.test(df$monthly_charges ~ df$churn)
Welch Two Sample t-test
data: df$monthly_charges by df$churn
t = -18.341, df = 4139.7, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-14.53786 -11.72998
sample estimates:
mean in group No mean in group Yes
61.30741 74.44133
t.test(df$tenure_months ~ df$churn)
Welch Two Sample t-test
data: df$tenure_months by df$churn
t = 34.972, df = 4045.5, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
18.56811 20.77364
sample estimates:
mean in group No mean in group Yes
37.65001 17.97913
t.test(df$total_charges ~ df$churn)
Welch Two Sample t-test
data: df$total_charges by df$churn
t = 18.801, df = 4042.9, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
916.8121 1130.2840
sample estimates:
mean in group No mean in group Yes
2555.344 1531.796
Preparing the Data: Splitting into Training and Testing Sets
Before we can apply logistic regression, it’s crucial to split our dataset into two parts: a training set and a testing set. This step ensures that our analysis is robust and that our predictions can generalize to new, unseen data.
The training set, which we’ll create using 80% of the data, is used to “teach” the logistic regression algorithm. This is where the model identifies patterns and relationships between the selected features and whether or not a customer churned. However, to truly evaluate how well the model performs, we need to test it on data it hasn’t encountered before.
That’s where the testing set comes in—the remaining 20% of the data is held back, unseen by the model during training. Once the model is trained, we’ll use the testing set to assess its accuracy and reliability. By comparing the model’s predictions against the actual outcomes in the testing set, we can measure how well it’s likely to perform on new data in a real-world scenario.
This split between training and testing is a foundational step in machine learning and ensures that our churn predictions aren’t just accurate for the dataset at hand but are also scalable and actionable for broader applications.
# Calculate the row indices for the training set (80%)
<- sample(1:nrow(df), size = 0.8 * nrow(df))
training_indices
# Create the training set (80% of the data)
<- df %>%
df_train slice(training_indices)
# Create the testing set (remaining 20% of the data)
<- df %>%
df_test slice(-training_indices)
Running the Algorithm
Below I have setup the logistic regression model in R. With the model built, the code then predicts churn for the 20% test group then outputs data on how well the regression predicted churn for that test set.
Note…here I’m essentially using ‘generic’ settings to get a result. The purpose of this post isn’t to go into detail on the fine tuning of the algorithm…future posts will dig more deeply into the details.
<- trainControl(method = "cv", number = 10, classProbs = TRUE,
ctrl summaryFunction = twoClassSummary)
<- train(churn ~ senior_citizen + partner + dependents + tenure_months +
glm.fit + multiple_lines + internet_service + online_security +
phone_service + device_protection + tech_support + streaming_tv +
online_backup + contract + paperless_billing + payment_method +
streaming_movies + total_charges,
monthly_charges data = df_train, method = "glm", metric = "ROC",
preProcess = c("center", "scale"), trControl = ctrl)
<- glm.fit %>%
glm.predictions predict(df_test)
<- data.frame(glm=confusionMatrix(glm.predictions, df_test$churn,
glm.cm positive = "Yes", mode = "everything")$byClass)
confusionMatrix(glm.predictions, df_test$churn, positive = "Yes", mode = "everything")
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 951 159
Yes 101 196
Accuracy : 0.8152
95% CI : (0.7939, 0.8352)
No Information Rate : 0.7477
P-Value [Acc > NIR] : 1.027e-09
Kappa : 0.4822
Mcnemar's Test P-Value : 0.0004078
Sensitivity : 0.5521
Specificity : 0.9040
Pos Pred Value : 0.6599
Neg Pred Value : 0.8568
Precision : 0.6599
Recall : 0.5521
F1 : 0.6012
Prevalence : 0.2523
Detection Rate : 0.1393
Detection Prevalence : 0.2111
Balanced Accuracy : 0.7281
'Positive' Class : Yes
Interpreting the Logistic Regression Results: The Confusion Matrix
After running our logistic regression model and comparing its predictions against the test dataset, we use a confusion matrix to summarize the results. Here’s what the matrix and accompanying statistics tell us in simple terms:
Actual: No | Actual: Yes | |
---|---|---|
Predicted: No | 951 | 159 |
Predicted: Yes | 101 | 196 |
True Negatives (951): The model correctly predicted 951 customers who did not churn.
False Negatives (159): The model missed 159 customers who actually churned but were predicted to stay.
True Positives (196): The model correctly predicted 196 customers who churned.
False Positives (101): The model incorrectly predicted 101 customers would churn, but they didn’t.
Key Performance Metrics
Accuracy (81.52%): Overall, the model made correct predictions 81.52% of the time. This means it generally performs well at distinguishing between churned and non-churned customers.
Sensitivity (55.21%): Sensitivity, or recall, measures how well the model identifies customers who churned. Here, it caught only 55.21% of actual churners, meaning some customers at risk of leaving were missed.
Specificity (90.40%): Specificity shows how well the model identifies customers who stayed. At 90.40%, it is highly effective at correctly predicting non-churners.
Precision (65.99%): This metric evaluates how many of the customers predicted to churn actually did. About 66% of the churn predictions were correct.
Balanced Accuracy (72.81%): Balances sensitivity and specificity, showing the model performs fairly well overall, even though sensitivity is lower.
What Does This Mean?
The model does a good job of identifying customers who are not at risk of churn (high specificity), but it struggles a bit to catch all customers who are at risk (lower sensitivity). This means while the model is reliable for confirming who will stay, it may miss some opportunities to retain churn-prone customers.
Improving sensitivity could be a key next step, perhaps by refining the features used, adjusting thresholds, or exploring more advanced algorithms. Still, with an accuracy of over 81%, this model offers a strong foundation for making proactive, data-driven decisions to reduce churn.
Conclusion: A High-Level Walkthrough of Predicting Customer Churn
In this blog, I walked through the key steps for building a churn prediction model, focusing on the high-level process rather than diving deep into every detail. While I touched only briefly on Exploratory Data Analysis (EDA), I used statistical tools like the t-test and Chi-Squared test to select meaningful features from the dataset. This ensured the model was built on variables with the strongest potential to explain churn.
After splitting the data into training and testing sets, I applied logistic regression—a foundational algorithm for binary classification problems like churn prediction. The result was a model with over 80% accuracy, a solid foundation for identifying customers likely to leave.
In a real-world scenario, this would be just the beginning. I would refine and tune the model further, exploring different features, adjusting thresholds, or trying advanced algorithms to improve performance. However, the purpose of this blog was to provide a structured, high-level overview of the process, and I think I successfully achieved that goal.
With a roadmap of these steps in hand, businesses can begin using data-driven insights to proactively address churn, improve customer retention, and ultimately drive growth. Future posts will delve deeper into techniques like EDA, model tuning, and advanced algorithms to build on the foundation laid here.