Understanding and Addressing Customer Churn in Subscription-Based Businesses

In the competitive world of subscription-based businesses, customer retention is just as important—if not more so—than customer acquisition. Every churned customer represents not only a loss in recurring revenue but also a missed opportunity to build a long-term relationship. This makes predicting and addressing customer churn a critical area of focus for companies that rely on subscription models.

Churn prediction is more than just a numbers game; it’s a strategic lever for sustainable growth. By identifying customers at risk of leaving, businesses can take preemptive measures to retain them—whether that means offering a personalized incentive, providing additional support, or addressing specific pain points. Such proactive interventions can transform potential losses into renewed loyalty.

However, the success of any churn mitigation strategy depends on precision. Casting too wide a net—like sending coupons to the entire customer base—can quickly become a costly exercise, eroding margins and diminishing the return on investment. Conversely, an overly narrow approach risks missing at-risk customers altogether. Striking the right balance requires an effective churn prediction model, one capable of accurately identifying the subset of customers most likely to churn.

In this blog, we’ll delve into the data and methods behind churn prediction, exploring how to balance analytical rigor with actionable business strategies to reduce churn and maximize customer lifetime value.

Predicting Churn in the Real World

During my time at Jama Software, I spearheaded efforts to predict and reduce customer churn by integrating insights from application usage data. By analyzing how customers interacted with our platform, we identified key behavioral patterns that distinguished our most engaged users from those at risk of leaving.

Working closely with the product team, we embedded flags within the application’s code to track feature usage. This allowed us to understand which parts of the platform drove value for our best customers and which behaviors signaled potential disengagement. These insights were combined with customer profiles and contract details to create predictive models that flagged accounts at risk of churning.

The resulting insights were invaluable for the Customer Success team, enabling them to proactively engage at-risk customers. Whether it was through personalized outreach, targeted training, or addressing specific pain points, these interventions helped foster stronger customer relationships while reducing churn. This experience reinforced for me the power of data-driven decision-making in addressing complex business challenges.

Exploring Churn with Kaggle Data

While my experience at Jama Software provided firsthand insights into predicting customer churn, the specific data and methods used there remain proprietary. To illustrate similar concepts and methodologies here, I’ll turn to a publicly available dataset from Kaggle, which contains information on over 7,000 cell-phone users.

This dataset offers a rich mix of categorical and quantitative variables, making it an excellent resource for exploring the drivers of churn and the steps businesses can take to address it. By analyzing this data, we can build a clearer picture of the patterns and predictors of churn, as well as demonstrate practical approaches for mitigating it.

Here’s a link to the source dataset: https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset

First, I’m going to setup an environment in R to read in the data and clean it up a bit.

suppressMessages(library(tidyverse))
suppressMessages(library(readxl))
suppressMessages(library(ggmosaic))
suppressMessages(library(caret))
suppressMessages(library(plotly))
suppressMessages(library(randomForest))
suppressMessages(library(RColorBrewer))
suppressMessages(library(lubridate))
suppressMessages(library(janitor))

set.seed(42)
theme_set(theme_minimal())

df_full <- clean_names(read_excel("../../data/Telco_customer_churn.xlsx")) %>%
     mutate(churn_label = as.factor(churn_label),
            gender = as.factor(gender),
            senior_citizen = as.factor(senior_citizen),
            partner = as.factor(partner),
            dependents = as.factor(dependents),
            phone_service = as.factor(phone_service),
            multiple_lines = as.factor(multiple_lines),
            internet_service = as.factor(internet_service),
            online_security = as.factor(online_security),
            online_backup = as.factor(online_backup),
            device_protection = as.factor(device_protection),
            tech_support = as.factor(tech_support),
            streaming_tv = as.factor(streaming_tv),
            streaming_movies = as.factor(streaming_movies),
            contract = as.factor(contract), 
            paperless_billing = as.factor(paperless_billing),
            payment_method = as.factor(payment_method)) %>%
     rename(churn = churn_label)


# Remove unnecessary columns and rows with NA
df <- df_full[,c(10:29)] %>%
  na.omit()

The Data

str(df)

tibble [7,032 × 20] (S3: tbl_df/tbl/data.frame)
 $ gender           : Factor w/ 2 levels "Female","Male": 2 1 1 1 2 1 2 2 2 2 ...
 $ senior_citizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
 $ partner          : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 2 2 ...
 $ dependents       : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 1 1 2 1 ...
 $ tenure_months    : num [1:7032] 2 2 8 28 49 10 1 1 47 1 ...
 $ phone_service    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 1 ...
 $ multiple_lines   : Factor w/ 3 levels "No","No phone service",..: 1 1 3 3 3 1 2 1 3 2 ...
 $ internet_service : Factor w/ 3 levels "DSL","Fiber optic",..: 1 2 2 2 2 1 1 3 2 1 ...
 $ online_security  : Factor w/ 3 levels "No","No internet service",..: 3 1 1 1 1 1 1 2 1 1 ...
 $ online_backup    : Factor w/ 3 levels "No","No internet service",..: 3 1 1 1 3 1 1 2 3 3 ...
 $ device_protection: Factor w/ 3 levels "No","No internet service",..: 1 1 3 3 3 3 3 2 1 1 ...
 $ tech_support     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 3 1 2 1 1 ...
 $ streaming_tv     : Factor w/ 3 levels "No","No internet service",..: 1 1 3 3 3 1 1 2 3 1 ...
 $ streaming_movies : Factor w/ 3 levels "No","No internet service",..: 1 1 3 3 3 1 3 2 3 1 ...
 $ contract         : Factor w/ 3 levels "Month-to-month",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ paperless_billing: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 1 2 1 ...
 $ payment_method   : Factor w/ 4 levels "Bank transfer (automatic)",..: 4 3 3 3 1 2 3 4 3 3 ...
 $ monthly_charges  : num [1:7032] 53.9 70.7 99.7 104.8 103.7 ...
 $ total_charges    : num [1:7032] 108 152 820 3046 5036 ...
 $ churn            : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, "na.action")= 'omit' Named int [1:11] 2235 2439 2569 2668 2857 4332 4688 5105 5720 6773 ...
  ..- attr(*, "names")= chr [1:11] "2235" "2439" "2569" "2668" ...

We have a dataframe with 7,032 rows and 20 columns. The key variable here at the bottom is named ‘churn’, and that’s a Yes/No factor. Other than churn, we have 16 categorical variables (factors), like gender, senior_citizen, etc. And, we ahve three quantitative variables: tenure in months, monthly charges, and total charges.

Before diving into the predictive modeling phase, it’s standard practice to conduct a thorough analysis of the dataset—a process known as Exploratory Data Analysis (EDA). EDA helps uncover patterns, relationships, and potential anomalies in the data, providing essential context for building accurate and reliable models.

In a full EDA, I’d typically explore the distribution of variables, look for correlations, and assess the quality of the data. This might include visualizing trends, identifying outliers, and understanding how features interact with each other and the target variable (in this case, customer churn).

However, since the primary focus of this blog is on applying logistic regression, I’ll touch only briefly on a few example variables from the dataset to give a sense of its structure. Future posts will dig more deeply into EDA processes.

Gender:

df %>%
  group_by(gender) %>%
  count(churn) %>%
  ggplot(aes(x=gender, y= n, fill = churn)) +
    geom_bar(stat = "identity", position = position_dodge()) + 
    scale_fill_brewer(palette = "Set1") +
    labs(y = "Frequency", x = "Gender", fill = "Churn")

First, I’ve plotted the frequency table for Gender and Churn…visually we can quickly see that the Churn result seems about the same for both Genders. We’ll use another method below to look at the relationship, but a graph like this would pretty quickly strike this categorical variable from the interesting list.

Contract Length:

df %>%
  group_by(contract) %>%
  count(churn) %>%
  ggplot(aes(x=contract, y= n, fill = churn)) +
    geom_bar(stat = "identity", position = position_dodge()) + 
    scale_fill_brewer(palette = "Set1") +
    labs(y = "Frequency", x = "Contract", fill = "Churn")

In contrast to gender, when I plot a frequency table for “Contract” against “Churn” an interesting relationship seems to jump out. Customers on a one-year or two-year contract seem to churn at a much lower rate than customers on a month-to-month contract. In itself, this isn’t an entirely unexpected observation, but this chart would get the variable “Contract” on the interesting list of variables to dig into further.

Tenure in Months:

df %>%
     ggplot(aes(x=tenure_months, fill = churn)) +
       geom_density(alpha = .4) +
       scale_fill_brewer(palette = "Set1") +
       labs(title = "Monthly Tenure by Churn Status", fill = "Churn",
            x="Tenure Months", y="")

This plot looks different because the variable “tenure_months” is a qualitative variable so we can look at a whole range of values, not just Yes/No. Here I’ve plotted the number of months each customer has been with the company, with the fill color being whether or not the have churned. Somewhat interestingly, if the customer makes it over a year they appear less and less likely to churn. Again, this is the type of interesting plot that would warrent further examination.

Predicting Churn: An Introduction to Logistic Regression

To predict customer churn, one of the most accessible and widely used techniques is logistic regression. While the name might sound technical, the concept is straightforward: logistic regression is a mathematical method that helps us determine the likelihood of an event happening—in this case, whether a customer will churn.

Think of it like drawing a line through a scatterplot of data points, where each point represents a customer. The goal isn’t just to divide the data into “churners” and “non-churners,” but to estimate the probability that any given customer might churn. These probabilities, which range between 0 and 1, can then guide targeted interventions.

Unlike traditional linear regression, which predicts continuous values like sales or revenue, logistic regression focuses on binary outcomes—yes or no, churn or stay. It does this by applying a special mathematical function that transforms raw data into probabilities, making it ideal for this kind of prediction.

In this blog, I’ll use logistic regression to analyze the Kaggle dataset, identifying the key factors that influence churn and demonstrating how businesses can act on these insights. In future posts, we’ll explore more advanced techniques like Random Forest, but for now, we’ll focus on this foundational and highly effective method.

Selecting Features: Using the Chi-Squared Test for Categorical Variables

Before diving into logistic regression, it’s essential to identify which features (or variables) in our dataset are most relevant to predicting churn. We somewhat started that process above by looking at the variable, but here I’ll describe a somewhat more formal way for selecting variables. For categorical variables, one powerful tool for this is the Chi-Squared test.

The Chi-Squared test helps us determine if there is a meaningful relationship between two categorical variables—in this case, between each feature (e.g., contract type, payment method) and whether or not a customer churned. It works by comparing the observed data (what actually happened) to what we’d expect if there were no relationship at all.

For example, if churn and contract type were completely unrelated, we’d expect customers with monthly contracts to churn at the same rate as those with annual contracts. The Chi-Squared test quantifies the difference between this “expected” scenario and the actual data, providing a numerical result called the p-value. A low p-value indicates that the feature and churn are likely connected and should be considered in our model.

chisq.test(df$contract, df$churn)


    Pearson's Chi-squared test

data:  df$contract and df$churn
X-squared = 1179.5, df = 2, p-value < 2.2e-16

In this example, the p-value is 0.00000000000000022 … very small, and a factor we should use in our analysis.

By applying this test to each categorical variable, we can narrow down our list of features to those with the strongest potential impact on churn. This step ensures our analysis focuses on the most relevant data, making our predictions more accurate and actionable.

To cycle through each variable, the function below will return a table with the chi-square value and the p-value for each catagorical variable compared up against churn.

# Function to go through and conduct the Chi-Squared test on each factor in the table against Churn

perform_chisq_tests <- function(data, target_var) {
  # Ensure the target variable is a factor
  if (!is.factor(data[[target_var]])) {
    stop("The target variable must be a factor.")
  }
  
  # Initialize a list to store the results
  results <- list()
  
  # Iterate through columns in the tibble
  for (col in names(data)) {
    if (col != target_var && is.factor(data[[col]])) {
      # Create a contingency table
      contingency_table <- table(data[[col]], data[[target_var]])
      
      # Perform the chi-squared test
      test_result <- tryCatch({
        chisq.test(contingency_table)
      }, warning = function(w) {
        message(paste("Warning for column:", col, "-", w$message))
        return(NULL)
      }, error = function(e) {
        message(paste("Error for column:", col, "-", e$message))
        return(NULL)
      })
      
      # Store the result if successful
      if (!is.null(test_result)) {
        results[[col]] <- list(
          variable = col,
          chi_squared = test_result$statistic,
          p_value = test_result$p.value
        )
      }
    }
  }
  
  # Convert results to a tibble
  results_df <- bind_rows(results, .id = "variable") %>%
    select(variable, chi_squared, p_value) # Ensure desired columns
  
  return(results_df)
}



results <- perform_chisq_tests(df, "churn") %>%
     arrange(-chi_squared)

results

# A tibble: 16 × 3
   variable          chi_squared   p_value
   <chr>                   <dbl>     <dbl>
 1 contract             1180.    7.33e-257
 2 online_security       847.    1.40e-184
 3 tech_support          825.    7.41e-180
 4 internet_service      729.    5.83e-159
 5 payment_method        645.    1.43e-139
 6 online_backup         599.    7.78e-131
 7 device_protection     556.    1.96e-121
 8 dependents            432.    7.10e- 96
 9 streaming_movies      374.    5.35e- 82
10 streaming_tv          372.    1.32e- 81
11 paperless_billing     257.    8.24e- 58
12 senior_citizen        158.    2.48e- 36
13 partner               158.    3.97e- 36
14 multiple_lines         11.3   3.57e-  3
15 phone_service           0.874 3.50e-  1
16 gender                  0.475 4.90e-  1

The resulting table ranks our categorical variables by the Chi-squared metric. At the bottom of the list we see “gender”. The earlier plot suggested that their wasn’t an interesting relationship between it and “churn”…this metric confirms that thought.

At the top of the list is “contract”…again, a confirmation of what was shown earlier in the plot.

Selecting Features: Using the t-Test for Quantitative Variables

For quantitative variables, we turn to the t-test, a statistical tool that helps determine whether a measurable feature (like total charges or monthly bill amount) is related to customer churn. The t-test operates within the framework of hypothesis testing, allowing us to systematically assess the significance of relationships in the data.

Here’s how it works: we start with the null hypothesis, which assumes there is no relationship between the quantitative variable and churn—essentially, that the feature’s average value is the same for customers who churned and those who stayed. The t-test then calculates the likelihood that this assumption is true by comparing the means of the two groups.

The result of the test is a p-value, which represents the probability of observing the data if the null hypothesis were correct. A low p-value (typically less than 0.05) suggests that the null hypothesis is unlikely to be true, meaning there is likely a meaningful difference in the variable between churned and non-churned customers.

For example, if customers who churned had significantly higher monthly charges than those who stayed, the t-test would reveal this difference through a low p-value, indicating the variable is worth including in our model. Like the Chi-Squared test for categorical variables, the t-test ensures we focus on the features that are most likely to impact churn, setting the stage for a more accurate logistic regression analysis.

t.test(df$monthly_charges ~ df$churn)


    Welch Two Sample t-test

data:  df$monthly_charges by df$churn
t = -18.341, df = 4139.7, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -14.53786 -11.72998
sample estimates:
 mean in group No mean in group Yes 
         61.30741          74.44133

t.test(df$tenure_months ~ df$churn)


    Welch Two Sample t-test

data:  df$tenure_months by df$churn
t = 34.972, df = 4045.5, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 18.56811 20.77364
sample estimates:
 mean in group No mean in group Yes 
         37.65001          17.97913

t.test(df$total_charges ~ df$churn)


    Welch Two Sample t-test

data:  df$total_charges by df$churn
t = 18.801, df = 4042.9, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
  916.8121 1130.2840
sample estimates:
 mean in group No mean in group Yes 
         2555.344          1531.796

Preparing the Data: Splitting into Training and Testing Sets

Before we can apply logistic regression, it’s crucial to split our dataset into two parts: a training set and a testing set. This step ensures that our analysis is robust and that our predictions can generalize to new, unseen data.

The training set, which we’ll create using 80% of the data, is used to “teach” the logistic regression algorithm. This is where the model identifies patterns and relationships between the selected features and whether or not a customer churned. However, to truly evaluate how well the model performs, we need to test it on data it hasn’t encountered before.

That’s where the testing set comes in—the remaining 20% of the data is held back, unseen by the model during training. Once the model is trained, we’ll use the testing set to assess its accuracy and reliability. By comparing the model’s predictions against the actual outcomes in the testing set, we can measure how well it’s likely to perform on new data in a real-world scenario.

This split between training and testing is a foundational step in machine learning and ensures that our churn predictions aren’t just accurate for the dataset at hand but are also scalable and actionable for broader applications.

# Calculate the row indices for the training set (80%)
training_indices <- sample(1:nrow(df), size = 0.8 * nrow(df))

# Create the training set (80% of the data)
df_train <- df %>%
  slice(training_indices)

# Create the testing set (remaining 20% of the data)
df_test <- df %>%
  slice(-training_indices)

Running the Algorithm

Below I have setup the logistic regression model in R. With the model built, the code then predicts churn for the 20% test group then outputs data on how well the regression predicted churn for that test set.

Note…here I’m essentially using ‘generic’ settings to get a result. The purpose of this post isn’t to go into detail on the fine tuning of the algorithm…future posts will dig more deeply into the details.

ctrl <- trainControl(method = "cv", number = 10, classProbs = TRUE, 
                     summaryFunction = twoClassSummary)

glm.fit <- train(churn ~ senior_citizen + partner + dependents + tenure_months + 
             phone_service + multiple_lines + internet_service + online_security + 
             online_backup + device_protection + tech_support + streaming_tv + 
             streaming_movies + contract + paperless_billing + payment_method + 
             monthly_charges + total_charges, 
             data = df_train, method = "glm", metric = "ROC",
                 preProcess = c("center", "scale"), trControl = ctrl)

glm.predictions <- glm.fit %>% 
  predict(df_test)


glm.cm <- data.frame(glm=confusionMatrix(glm.predictions, df_test$churn, 
                                       positive = "Yes", mode = "everything")$byClass)


confusionMatrix(glm.predictions, df_test$churn, positive = "Yes", mode = "everything")

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  951 159
       Yes 101 196
                                          
               Accuracy : 0.8152          
                 95% CI : (0.7939, 0.8352)
    No Information Rate : 0.7477          
    P-Value [Acc > NIR] : 1.027e-09       
                                          
                  Kappa : 0.4822          
                                          
 Mcnemar's Test P-Value : 0.0004078       
                                          
            Sensitivity : 0.5521          
            Specificity : 0.9040          
         Pos Pred Value : 0.6599          
         Neg Pred Value : 0.8568          
              Precision : 0.6599          
                 Recall : 0.5521          
                     F1 : 0.6012          
             Prevalence : 0.2523          
         Detection Rate : 0.1393          
   Detection Prevalence : 0.2111          
      Balanced Accuracy : 0.7281          
                                          
       'Positive' Class : Yes

Interpreting the Logistic Regression Results: The Confusion Matrix

After running our logistic regression model and comparing its predictions against the test dataset, we use a confusion matrix to summarize the results. Here’s what the matrix and accompanying statistics tell us in simple terms:

	Actual: No	Actual: Yes
Predicted: No	951	159
Predicted: Yes	101	196

True Negatives (951): The model correctly predicted 951 customers who did not churn.

False Negatives (159): The model missed 159 customers who actually churned but were predicted to stay.

True Positives (196): The model correctly predicted 196 customers who churned.

False Positives (101): The model incorrectly predicted 101 customers would churn, but they didn’t.

Key Performance Metrics

Accuracy (81.52%): Overall, the model made correct predictions 81.52% of the time. This means it generally performs well at distinguishing between churned and non-churned customers.

Sensitivity (55.21%): Sensitivity, or recall, measures how well the model identifies customers who churned. Here, it caught only 55.21% of actual churners, meaning some customers at risk of leaving were missed.

Specificity (90.40%): Specificity shows how well the model identifies customers who stayed. At 90.40%, it is highly effective at correctly predicting non-churners.

Precision (65.99%): This metric evaluates how many of the customers predicted to churn actually did. About 66% of the churn predictions were correct.

Balanced Accuracy (72.81%): Balances sensitivity and specificity, showing the model performs fairly well overall, even though sensitivity is lower.

What Does This Mean?

The model does a good job of identifying customers who are not at risk of churn (high specificity), but it struggles a bit to catch all customers who are at risk (lower sensitivity). This means while the model is reliable for confirming who will stay, it may miss some opportunities to retain churn-prone customers.

Improving sensitivity could be a key next step, perhaps by refining the features used, adjusting thresholds, or exploring more advanced algorithms. Still, with an accuracy of over 81%, this model offers a strong foundation for making proactive, data-driven decisions to reduce churn.

Conclusion: A High-Level Walkthrough of Predicting Customer Churn

In this blog, I walked through the key steps for building a churn prediction model, focusing on the high-level process rather than diving deep into every detail. While I touched only briefly on Exploratory Data Analysis (EDA), I used statistical tools like the t-test and Chi-Squared test to select meaningful features from the dataset. This ensured the model was built on variables with the strongest potential to explain churn.

After splitting the data into training and testing sets, I applied logistic regression—a foundational algorithm for binary classification problems like churn prediction. The result was a model with over 80% accuracy, a solid foundation for identifying customers likely to leave.

In a real-world scenario, this would be just the beginning. I would refine and tune the model further, exploring different features, adjusting thresholds, or trying advanced algorithms to improve performance. However, the purpose of this blog was to provide a structured, high-level overview of the process, and I think I successfully achieved that goal.

With a roadmap of these steps in hand, businesses can begin using data-driven insights to proactively address churn, improve customer retention, and ultimately drive growth. Future posts will delve deeper into techniques like EDA, model tuning, and advanced algorithms to build on the foundation laid here.