Revisiting Churn: Random Forest

Random Forest
Customer Churn
SaaS
Author

Tim Anderson

Published

April 20, 2021

Beyond Logistic Regression: Exploring Random Forest for Churn Prediction

In the previous blog, I explored how logistic regression can be used to predict customer churn in subscription-based businesses. The resulting model produced an accuracy of around 81.5%, indicating it could reliably identify customers at risk of leaving. However, in the realm of predictive analytics, logistic regression is just one method among many. Different algorithms provide different advantages—and sometimes reveal insights that simpler models might miss.

In this post, I’ll shift focus to Random Forest, another popular algorithm for classification tasks. Specifically, I’ll contrast it with logistic regression and discuss why a slightly lower accuracy (in this case) doesn’t necessarily mean Random Forest is a poor choice for churn prediction. In many cases, the nuances of what an algorithm can reveal—and how it can handle complex data—may be more valuable than a one- or two-point difference in accuracy.


What Is Random Forest?

Random Forest is an ensemble learning method that builds upon the idea of decision trees. In simple terms, it trains many different decision trees (often hundreds) on bootstrapped subsets of your data, then aggregates their individual predictions (via majority vote for classification problems). This “wisdom of the crowd” approach typically yields a model that:

  1. Reduces Overfitting: A single decision tree can become overly complex, memorizing training data rather than learning generalizable patterns. By averaging predictions across many trees, Random Forest smooths out extremes.

  2. Handles Complex Interactions: Logistic regression assumes a linear (or log-linear) relationship between features and churn probability, but Random Forest can naturally capture nonlinear relationships and interactions among variables.

  3. Provides Variable Importance: One of the main perks of Random Forest is a built-in “variable importance” measure, helping you see which features (e.g., contract length, monthly charges) are most predictive of churn.

Despite these strengths, Random Forest can require more computational power and can sometimes be trickier to interpret than a simple logistic regression model—though it’s still often considered more interpretable than other advanced algorithms like deep neural networks.


Why Compare Random Forest to Logistic Regression?

When it comes to churn prediction, many organizations start with logistic regression because it’s:

  • Intuitive: Coefficients directly relate to the likelihood of churn, making it easy to explain to non-technical stakeholders.

  • Fast: Training a logistic regression model generally requires fewer computational resources. (Rendering the prior blog markdown file took about 4 seconds…this markdown document takes more than 60 seconds to render)

  • Solid Baseline: Often performs surprisingly well, especially if your data relationships are near-linear or your dataset is not overly complex.

Random Forest, on the other hand, shines in scenarios where data exhibits many complex interactions or when you want a robust algorithm that accounts for outliers or missing data more gracefully. It can uncover patterns that a linear approach might miss (for example, how the combination of “month-to-month contract” + “online_backup = No” + “gender = Female” might collectively increase churn risk).

In short, each algorithm has its pros and cons, so comparing them can provide a more holistic picture of your churn dynamics.


Let’s setup an environment in R to read in the data and clean it up a bit.

suppressMessages(library(tidyverse))
suppressMessages(library(readxl))
suppressMessages(library(ggmosaic))
suppressMessages(library(caret))
suppressMessages(library(plotly))
suppressMessages(library(randomForest))
suppressMessages(library(RColorBrewer))
suppressMessages(library(lubridate))
suppressMessages(library(janitor))

set.seed(42)
theme_set(theme_minimal())

df_full <- clean_names(read_excel("../../data/Telco_customer_churn.xlsx")) %>%
     mutate(churn_label = as.factor(churn_label),
            gender = as.factor(gender),
            senior_citizen = as.factor(senior_citizen),
            partner = as.factor(partner),
            dependents = as.factor(dependents),
            phone_service = as.factor(phone_service),
            multiple_lines = as.factor(multiple_lines),
            internet_service = as.factor(internet_service),
            online_security = as.factor(online_security),
            online_backup = as.factor(online_backup),
            device_protection = as.factor(device_protection),
            tech_support = as.factor(tech_support),
            streaming_tv = as.factor(streaming_tv),
            streaming_movies = as.factor(streaming_movies),
            contract = as.factor(contract), 
            paperless_billing = as.factor(paperless_billing),
            payment_method = as.factor(payment_method)) %>%
     rename(churn = churn_label)


# Remove unnecessary columns and rows with NA
df <- df_full[,c(10:29)] %>%
  na.omit()

The Data

str(df)
tibble [7,032 × 20] (S3: tbl_df/tbl/data.frame)
 $ gender           : Factor w/ 2 levels "Female","Male": 2 1 1 1 2 1 2 2 2 2 ...
 $ senior_citizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
 $ partner          : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 2 2 ...
 $ dependents       : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 1 1 2 1 ...
 $ tenure_months    : num [1:7032] 2 2 8 28 49 10 1 1 47 1 ...
 $ phone_service    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 1 ...
 $ multiple_lines   : Factor w/ 3 levels "No","No phone service",..: 1 1 3 3 3 1 2 1 3 2 ...
 $ internet_service : Factor w/ 3 levels "DSL","Fiber optic",..: 1 2 2 2 2 1 1 3 2 1 ...
 $ online_security  : Factor w/ 3 levels "No","No internet service",..: 3 1 1 1 1 1 1 2 1 1 ...
 $ online_backup    : Factor w/ 3 levels "No","No internet service",..: 3 1 1 1 3 1 1 2 3 3 ...
 $ device_protection: Factor w/ 3 levels "No","No internet service",..: 1 1 3 3 3 3 3 2 1 1 ...
 $ tech_support     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 3 1 2 1 1 ...
 $ streaming_tv     : Factor w/ 3 levels "No","No internet service",..: 1 1 3 3 3 1 1 2 3 1 ...
 $ streaming_movies : Factor w/ 3 levels "No","No internet service",..: 1 1 3 3 3 1 3 2 3 1 ...
 $ contract         : Factor w/ 3 levels "Month-to-month",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ paperless_billing: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 1 2 1 ...
 $ payment_method   : Factor w/ 4 levels "Bank transfer (automatic)",..: 4 3 3 3 1 2 3 4 3 3 ...
 $ monthly_charges  : num [1:7032] 53.9 70.7 99.7 104.8 103.7 ...
 $ total_charges    : num [1:7032] 108 152 820 3046 5036 ...
 $ churn            : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, "na.action")= 'omit' Named int [1:11] 2235 2439 2569 2668 2857 4332 4688 5105 5720 6773 ...
  ..- attr(*, "names")= chr [1:11] "2235" "2439" "2569" "2668" ...

Again, we’re looking at a dataframe with 7,032 rows and 20 columns. The key variable here at the bottom is named ‘churn’, and that’s a Yes/No factor. Other than churn, we have 16 categorical variables (factors), like gender, senior_citizen, etc. And, we ahve three quantitative variables: tenure in months, monthly charges, and total charges.

Since we explored the data in the prior blog entry I’m going to skip that here.

Separate Training and Testing Data

# Calculate the row indices for the training set (80%)
training_indices <- sample(1:nrow(df), size = 0.8 * nrow(df))

# Create the training set (80% of the data)
df_train <- df %>%
  slice(training_indices)

# Create the testing set (remaining 20% of the data)
df_test <- df %>%
  slice(-training_indices)

Running the Algorithm

Below I have setup the Random Forest model in R. With the model built, the code then predicts churn for the 20% test group then outputs data on how well the regression predicted churn for that test set.

Note…like last time I’m using ‘generic’ settings to get a result. In fact, I’m simply throwing all of the variables used in the logistic regression into this algorithm…my aim here is more illustrative rather than getting the highest accuracy.

# Train control with 5-fold CV
train_ctrl <- trainControl(method = "cv", number = 5)

# Random Forest training

rf_model <- train(
  churn ~ senior_citizen + partner + dependents + tenure_months + 
             phone_service + multiple_lines + internet_service + online_security + 
             online_backup + device_protection + tech_support + streaming_tv + 
             streaming_movies + contract + paperless_billing + payment_method + 
             monthly_charges + total_charges,
  
  data = df_train,
  method = "rf",
  trControl = train_ctrl,
  tuneLength = 5
)

# Check results
print(rf_model)
Random Forest 

5625 samples
  18 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 4500, 4501, 4500, 4499, 4500 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.7969817  0.4004949
   8    0.7912928  0.4363956
  15    0.7880948  0.4297489
  22    0.7861388  0.4266362
  29    0.7854274  0.4251485

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
# Predict and evaluate
pred_rf <- predict(rf_model, newdata = df_test)
confusionMatrix(pred_rf, df_test$churn, positive = "Yes")
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1009  240
       Yes   43  115
                                          
               Accuracy : 0.7989          
                 95% CI : (0.7769, 0.8195)
    No Information Rate : 0.7477          
    P-Value [Acc > NIR] : 3.52e-06        
                                          
                  Kappa : 0.3468          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.32394         
            Specificity : 0.95913         
         Pos Pred Value : 0.72785         
         Neg Pred Value : 0.80785         
             Prevalence : 0.25231         
         Detection Rate : 0.08173         
   Detection Prevalence : 0.11230         
      Balanced Accuracy : 0.64153         
                                          
       'Positive' Class : Yes             
                                          

Understanding the Performance Gap

The Random Forest model achieved an accuracy of 79.89%, which is lower than the 81.52% produced by logistic regression. At first glance, we might wonder: “Isn’t Random Forest supposed to be more powerful?”

The truth is, model performance is highly dependent on the dataset—its size, the balance of classes (churn vs. non-churn), the presence of outliers or missing data, the number and nature of predictors, and so forth. Here are a few factors that can explain why Random Forest might lag slightly:

  1. Hyperparameter Tuning
    Random Forest has multiple hyperparameters (e.g., number of trees, mtry controlling how many features are considered per split, the depth of each tree, etc.). Fine-tuning these can significantly improve results, but it also requires extra time and effort.

  2. Data Characteristics
    If much of your data’s predictive power lies in linear relationships—for example, higher monthly charges correlating with higher churn rates—a well-specified logistic regression might do quite well out of the box.

  3. Over- vs. Under-Fitting
    While random forests are generally robust, if the data has very clear linear separation, a simpler model (like logistic regression) could capture that more directly. Meanwhile, the ensemble approach might be effectively “overkill.”

  4. Different Errors, Different Costs
    Accuracy is not the only metric to consider. Sometimes, you care more about sensitivity (catching as many churners as possible) or precision (ensuring you don’t waste retention efforts on customers likely to stay). It’s entirely possible that the random forest has slightly lower overall accuracy but does better on, say, sensitivity—reducing the risk of missing at-risk customers.


Beyond Accuracy: Why We Still Might Choose Random Forest

Even with a lower accuracy, Random Forest could still be beneficial. Here’s why:

  1. Robustness
    If your data grows in size or complexity (e.g., you add more features), Random Forest tends to scale better than logistic regression. Over time, it may adapt more readily to new data patterns.

  2. Handling Nonlinear Relationships
    Logistic regression forces a linear boundary. Random Forest can capture more nuanced relationships—like how very short tenure or very long tenure might both correspond to different churn behaviors, while the middle range is stable.

  3. Feature Importance Insights
    Businesses often appreciate how random forests rank features by importance. It’s a quick way to see which attributes drive churn the most in a nonlinear setting—something logistic regression’s coefficients may not always communicate effectively, especially if there are interactions or polynomial terms in play.

  4. Versatility
    If you expand from churn to other predictive tasks (e.g., up-sell, cross-sell, or next best action), you might find the flexibility of a random forest approach more readily adaptable to different classification problems.


Conclusion: Balancing Interpretability and Flexibility

Random Forest and logistic regression each offer distinct advantages:

  • Logistic Regression provides a straightforward, interpretable model that can be trained quickly. Its coefficients offer direct insight into how each variable affects the odds of churn.

  • Random Forest excels at capturing complex interactions and delivers robust performance across a wide range of datasets—especially when well-tuned. It also furnishes useful metrics like variable importance, which can complement or enhance business decision-making.

In your specific dataset, you observed that Random Forest gave an accuracy of 79.89%—slightly below the 81.52% from logistic regression. Rather than dismissing Random Forest outright, it’s worth reflecting on how accuracy fits into your broader retention strategies. You might discover that Random Forest picks up on different churn signals or offers more flexible predictions that can be refined over time.

Ultimately, the choice between these algorithms shouldn’t hinge on a single statistic. For many businesses, improving the odds of catching at-risk customers (even at the expense of some false alarms) is more crucial than maximizing overall accuracy. By weighing the unique benefits of each approach, you can incorporate the best of both worlds—building a comprehensive, data-driven churn mitigation strategy that suits your organization’s goals and resources.


Next Steps

  1. Feature Tuning: Explore deeper hyperparameter tuning for Random Forest to see if the accuracy can be pushed closer or beyond what logistic regression delivers.

  2. Metric Comparison: Look beyond accuracy—focus on sensitivity, precision, AUC, or even cost-based metrics to measure how effectively each model supports retention strategies.

  3. Integrate Insights: Use logistic regression for its interpretability and quick onboarding, and leverage Random Forest for uncovering complex patterns. Both can work in tandem in a real-world churn prevention system.

By understanding these nuances, you’ll be better prepared to harness advanced machine-learning techniques for deeper insight into your subscription data, ultimately giving you more tools to keep customers happy and reduce churn.