The Perils of Stepwise Regression - And what to do instead

code

concept

Data-driven variable selection does not represent best practice in applied statistics.

Published

May 29, 2026

1 Introduction

As a clinician/researcher you might not be aware that data-driven variable selection methods are widely regarded within the statistical community as the the equivalent of Facebook’s ‘Crazy Uncle’ - an individual who is misinformed, opinionated and someone we just can’t easily shed from our lives.

So what’s the parallel I’m trying to draw, you might ask?

Well, despite decades of criticism in the statistical literature, stepwise regression (as a form of data-driven variable selection) continues to appear in academic papers, regulatory submissions, and industry analyses, when it is universally recognised as plain wrong. Despite that, stepwise regression remains one of the most commonly taught and used model-building techniques in applied statistics. Indeed, I was taught these methods during my Masters of Biostatistics 10 years ago. The appeal is obvious: when faced with many candidate predictors and limited sample size, stepwise procedures promise an automated, objective way to “let the data decide” which variables matter.

Note

When I refer to “data-driven” or “stepwise” I am meaning any decision process leading to the addition or removal of candidate variables from a regression model that is based on data-derived statistics such as significance, R-squared, Akaike Information Criteria (AIC), etc…

So then, what’s the problem?

Well, there are many. The most notable I will summarise from page 68 of Frank Harrell’s modern statistical classic - Regression Modelling Strategies:

The R-squared or even adjusted R-squared values of the end model are biased high.
The F and Chi-square test statistics of the final model do not have the claimed distribution.
The standard errors of coefficient estimates are biased low and confidence intervals for effects and predictions are falsely narrow.
The p values are too small (there are severe multiple comparison problems in addition to problems 2. and 3.) and do not have the proper meaning, and it is difficult to correct for this.
The regression coefficients are biased high in absolute value and need shrinkage but this is rarely done.
Variable selection is made arbitrary by collinearity.
It allows us to not think about the problem.

The net effect of the above is that we may end up making claims, assertions or clinical decisions based on incorrect evidence.

In this post I will attempt to make the case that stepwise regression is not just suboptimal, but seriously flawed when used for inference, explanation, or decision-making. The problems are not subtle or academic; they directly undermine the validity, reproducibility, and interpretability of results. I will outline why stepwise regression fails, why these failures are amplified in applied research (particularly clinical and epidemiological settings), and what modern, defensible alternatives look like in practice.

The goal is not to shame analysts who use stepwise regression — I too am guilty of the multitude of sins I am about to discuss — but to provide a clear roadmap towards best practice.

2 Why Stepwise Regression Persists

Let’s get the elephant in the room out of the way. If there are so many issues with stepwise regression why is it still so commonly used in the analysis of clinical data? Well, I don’t think there is a single answer to this question, but rather several potential factors that probably interplay in maintaining its veneer of contemporary methodological relevance:

Historical inertia: it has been taught for decades and appears in older textbooks.
Fear of change: researchers may worry that journal editors/reviewers do not appreciate new approaches.
Cognitive appeal: produces a single, seemingly parsimonious model.
Ease of implementation: many statistical packages make stepwise selection easy and prominent.
Misplaced objectivity: automated procedures appear neutral, even when they enforce arbitrary thresholds on our data.

These justifications, when considered on an individual basis, become hard to defend. But before we look at what we can do instead, let’s first delve a little deeper into the specific problems associated with stepwise regression.

3 What is Stepwise Regression?

Most of you already know what I’m talking about, but let’s recap the basics for those who don’t.

Stepwise regression refers to a family of automated variable selection procedures applied to regression models. The most common variants are:

Forward selection: start with no predictors, then add variables one at a time based on some criterion (often the smallest p-value or largest improvement in AIC).
Backward elimination: start with all candidate predictors, then remove variables that fail to meet a significance threshold.
Bidirectional (stepwise) selection: alternate between adding and removing variables at each step.

These procedures are typically driven by hypothesis tests (e.g. p < 0.05) or information criteria (AIC, BIC), and they terminate once no further steps improve the chosen criterion. The end result is a single model containing a data-driven subset of the original pool of available candidate predictor variables.

4 The Core Statistical Problems

Although I mentioned these at the outset of this post (based on Harrell’s text), let’s flesh some of them out in more detail now.

4.1 Invalid Inference After Selection

Perhaps the most fundamental problem is that standard inferential quantities are wrong after stepwise selection. When using data-driven variable selection:

P-values, confidence intervals, and standard errors are computed as if the model had been specified in advance.

But in reality, the data were used at least twice: once to select variables, and again to estimate their effects.

This reuse of data leads to:

underestimated standard errors,
overly narrow confidence intervals,
and inflated statistical significance.

You may not be aware that statistical inference methods using p values, standard errors and confidence intervals were designed to be valid when applied once to a pre-specified model, not iteratively reapplied to a model where chance partly informs the decision at each step. This invalidates the nominal properties of a statistical test, giving us as researchers, a false sense of certainty.

4.2 Multiple Testing in Disguise

Stepwise regression performs a large number of implicit hypothesis tests while pretending to conduct only a few. Each step involves testing multiple candidate variables, but no adjustment is made for multiplicity.

Viewed correctly, stepwise selection is a form of uncorrected multiple testing with a stopping rule. The resulting family-wise Type I error rate can be dramatically higher than the nominal level, especially when predictors are correlated.

This means that “statistically significant” predictors selected by stepwise methods are often artifacts of chance rather than real signals.

4.3 Overfitting and Optimism

Stepwise procedures are designed to optimise in-sample fit. As a result, they tend to overfit, especially when:

the number of predictors is large relative to the sample size,
predictors are correlated,
or the true signal-to-noise ratio is modest.

Overfitted models appear impressive in the training data but perform poorly on new data. This optimism is rarely quantified or acknowledged when stepwise regression is used.

4.4 Model Instability

One of the most damaging properties of stepwise regression is instability. Small perturbations in the data - e.g. removing a few observations, changing a significance threshold, or using a different random split - can produce entirely different selected models.

This instability arises because stepwise selection operates near decision boundaries. When predictors are correlated or effects are weak, tiny changes can flip inclusion decisions.

An unstable model cannot be trusted for interpretation, explanation, or clinical decision-making.

4.5 Biased Coefficient Estimates

Even when a predictor truly has an effect, stepwise regression tends to overestimate its magnitude if it is selected. This phenomenon - sometimes called the “winner’s curse” - occurs because variables are selected given they look large in the sample. The result is that reported effect sizes tend to be exaggerated, with subsequent studies often failing to replicate the findings.

5 Why These Problems Matter in Applied Research

In some purely predictive settings, poor inference may be tolerable. In most applied research, however, regression models are used to:

draw causal conclusions,
inform clinical or policy decisions,
support regulatory submissions,
or generate scientific knowledge.

In these contexts, stepwise regression is especially problematic.

5.1 Clinical and Epidemiological Studies

In medical research, variable inclusion is often interpreted causally - even when analysts disclaim causal intent. A covariate that survives stepwise selection may be described as a “risk factor” or “independent predictor”, while excluded variables are implicitly treated as unimportant.

But this is deeply misleading. Stepwise regression does not distinguish between confounders, mediators, and colliders, and it frequently excludes clinically important variables simply because their effects are imprecisely estimated.

5.2 Regulatory and Reporting Contexts

In regulatory settings, such as clinical trial reporting, transparency and reproducibility are paramount. Stepwise regression undermines both, because:

the analysis path is data-driven and difficult to justify prospectively,
results may not reproduce across datasets or populations,
and inferential quantities lack a clear interpretation.

For these reasons, many regulatory guidelines explicitly discourage data-driven variable selection.

6 What to Do Instead

Ok, we’ve spent a fair amount of time talking about the problem. Let’s now discuss what we can actually do about it. The good news is that we are not short of alternatives. In fact, modern approaches are often simpler, more transparent, and more defensible.

I am going to discuss these approaches in the context of “model intent” - i.e. what is the purpose for your regression model? This ultimately boils down to one of two options - explanation vs prediction. Model explanation and model prediction answer fundamentally different questions. Explanatory models aim to estimate and interpret the effect of variables, often with a causal lens, which makes pre-specification, subject-matter knowledge, and valid inference essential. An example research question that such a model could answer might be:

“What is the effect of DMT x on the risk of disease relapse, after adjusting for age, sex, smoking status, and baseline disease severity?”

i.e. we are interested primarily in the explanatory ‘effect’ of a new DMT on disease relapse.

Predictive models, by contrast, are judged by how well they perform on new data, not by whether their coefficients are interpretable or causally meaningful - in other words, we don’t really care about individual variable ‘effects’ but rather just the overall predictive power of the model. Here, the commensurate research question takes a different form:

“Using age, sex, smoking status, current therapy and disease severity data at baseline, what is an individual’s 5-year risk of disease relapse?”

i.e. we want to predict relapse risk but don’t really care about any individual predictor.

6.1 Explanation

6.1.1 Pre-Specification and Subject-Matter Knowledge

When we are interested in model explanation, the most robust approach is also the least glamorous: decide in advance (i.e. prior to seeing the data) which variables belong in the model, and make that our one and only model.

Here, variable inclusion should be guided by:

scientific understanding,
causal reasoning,
and study design.

Consideration should be given to sythesising these justifications in a supporting causal diagram (DAG - see the previous post).

Note that if a variable is a known confounder, it should be included regardless of its p-value. In contrast, if it is irrelevant to the scientific question, it should not be tested for inclusion.

This approach prioritises validity over convenience. We are very interested in understanding the characteristics of each individual predictor variable (i.e. is it a potential confounder, collider, mediator, etc in the exposure -> outcome relationship), and the interplay between all such variables. Again, this is where a causal diagram, even if we are not planning an explicit causal analysis, can be very helpful. But this is where the hard work ends in explanatory modelling. From a coding perspective, things are easier than ever - we just run whatever model we have decided on (and nothing more), comfortable in the fact that our p-values and confidence intervals are correct.

6.1.2 Transparent Sensitivity Analyses

If, for whatever reason, pre-specification is not possible and variable selection cannot be avoided, resultant models should be treated as exploratory and accompanied by:

sensitivity analyses,
stability assessments,
and clear disclaimers about inferential limitations.

This is still inferior to pre-specification, but far better than unqualified stepwise regression.

6.2 Prediction

6.2.1 Full Models with Shrinkage

In contrast, when we are interested in model prediction our goals change. We are simply after raw predictive performance and it matters less about the individual variables that comprise the model or how they were chosen. There is actually a fairly nice solution to the problem of variable selection in the context of prediction modelling, and that is “regularisation” or “penalised regression” methods. These methods acknowledge uncertainty more honestly than stepwise procedures.

Regularisation is central to prediction because it controls overfitting by shrinking model coefficients, trading a small amount of bias for a large reduction in variance and thus improving performance on new data. Ridge regression shrinks all coefficients toward zero and is particularly effective when many predictors have small, correlated effects. Lasso applies stronger shrinkage that can set some coefficients exactly to zero, producing sparse models that are easier to deploy but can be unstable when predictors are highly correlated. Elastic net combines ridge and lasso penalties, often giving the best predictive performance in practice by encouraging sparsity while retaining groups of correlated predictors.

Keep reading to the end of this section to see an example of how penalised regression with Elastic net is used.

6.2.2 Internal vs External Validation

Because we are ultimately interested in predictive power, once we have established a prediction model we should evaluate its performance. There are multiple ways to do this (see here for a good primer), with the main techniques being:

cross-validation,
bootstrap resampling,
or, ideally, external validation datasets.

The general idea in each case is that we “train” our model on a dataset and then “test” its predictive power on new data. The reason for this is that it is not uncommon when we create a model from a single dataset that the model usually fits the peculiarities of that dataset much better than one it hasn’t seen before (i.e. it overfits). Thus, predictive performance will often be less on a “test” compared to a “train” dataset. In cross-validation and bootstrapping we are using sleight of hand to create “train” and “test” data from the only dataset we have - these are referred to as internal validation methods. While external validation is better because we can train the model on all of our data and then test on a completely unseen dataset, it is harder to achieve in practice for obvious reasons.

Whatever the method of model performance evaluation, the focus now changes from p-values and confidence intervals to predictive accuracy and generalisability.

6.2.3 Model Averaging

Sometimes we may end up with more than one candidate prediction model that have similar predictive power but are based on different modelling strategies (e.g. lasso vs random forest vs GAM). Model averaging is an approach that accounts for uncertainty about model choice by combining predictions or estimates from multiple plausible prediction models rather than relying on a single selected model. Instead of treating one model as “the truth,” it weights each model according to a measure of support - such as predictive performance, information criteria, or posterior probability - and averages across them. This can improve predictive accuracy and reduce overconfidence, particularly when several models perform similarly well. Model averaging is most natural when there is genuine uncertainty among a discrete set of competing models, and it contrasts with single-model selection approaches that ignore this uncertainty.

6.2.4 Elastic Net Example

Alright, let’s put some of what we’ve learned into practice. Let’s simulate some data and run both a stepwise regression and an elastic net, comparing the performance of each. For the simulation I will create a dataset with 500 observations and 20 predictors. I’ll make the regression coefficients for the first 5 predictors non-zero (representing real effects) and set the coefficients for the last 15 to zero (representing noise). I’ll also specify that all predictors are moderately correlated with each other (r = 0.4). With this data-generating process we can then meaningfully compare how stepwise regression and elastic net behave when only a few predictors truly matter (and hopefully find that the latter approach performs better).

Code

set.seed(123)

n <- 500
p <- 20

Sigma <- matrix(0.4, p, p)
diag(Sigma) <- 1

library(MASS)
X <- mvrnorm(n, mu = rep(0, p), Sigma = Sigma)

beta <- c(2, 2, 2, -1.5, -1.5, rep(0, p - 5))
y <- X %*% beta + rnorm(n, sd = 2)

dat <- data.frame(y, X)
colnames(dat) <- c("y", paste0("X", 1:p))

Now let’s create train and test datasets from the simulated data. To do this we’ll randomly choose 70% of the observations for the train dataset and allocate the remaining 30% to the test dataset.

Code

set.seed(456)
idx <- sample(seq_len(n), size = 0.7 * n)

train <- dat[idx, ]
test  <- dat[-idx, ]

We will now perform a classic stepwise regression using step() to automate the process. step() fits multiple models moving back and forth between a model with no predictors (intercept only model) and one with all 20 predictors (full model). At each stage it computes the AIC for each candidate model, moves to the model with the lowest AIC and then stops when no further addition or deletion of variables improves the AIC.

We can see that the first five predictors are correctly selected, but the algorithm also selects four noise predictors, two of which are deemed statistically significant.

Code

full_mod <- lm(y ~ ., data = train)
null_mod <- lm(y ~ 1, data = train)
  
step_mod <- step(
  null_mod,
  scope = list(lower = null_mod, upper = full_mod),
  direction = "both",
  trace = FALSE
)

summary(step_mod)


Call:
lm(formula = y ~ X1 + X3 + X5 + X2 + X4 + X15 + X10 + X14 + X18, 
    data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6759 -1.2016  0.0696  1.2556  5.7055 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.07313    0.10297   0.710   0.4781    
X1           2.08653    0.11979  17.418   <2e-16 ***
X3           2.12815    0.13199  16.123   <2e-16 ***
X5          -1.61482    0.12589 -12.827   <2e-16 ***
X2           2.02325    0.12951  15.622   <2e-16 ***
X4          -1.58574    0.12656 -12.530   <2e-16 ***
X15          0.30294    0.13056   2.320   0.0209 *  
X10         -0.32018    0.12951  -2.472   0.0139 *  
X14          0.22121    0.13102   1.688   0.0922 .  
X18         -0.20767    0.13261  -1.566   0.1183    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.907 on 340 degrees of freedom
Multiple R-squared:  0.7858,    Adjusted R-squared:  0.7802 
F-statistic: 138.6 on 9 and 340 DF,  p-value: < 2.2e-16

Let’s check the performance by using the model to predict on the test data, then calculating a common regression performance measure - the root mean square error (RMSE). The lower the RMSE, the better.

Code

y_pred_step <- predict(step_mod, newdata = test)
rmse_step <- sqrt(mean((test$y - y_pred_step)^2))
rmse_step

[1] 2.018798

Now let’s compare this to an elastic net, using the glmnet package, which fits penalised regression models (ridge, lasso, elastic net). We can break this code block down into a series of steps as follows:

glmnet fits many elastic net models, each with:
- the same predictors,
- the same alpha = 0.5 (mix of ridge and lasso),
- different values of lambda (penalty strength).
It then performs 10-fold cross-validation:
- the training data are split into 10 folds,
- each fold is held out in turn,
- prediction error is estimated for each lambda.
And finally, chooses the optimal penalty:
- lambda.min → lowest cross-validated error,
- lambda.1se → largest λ within 1 SE of the minimum (simpler model).

This step answers: “How much regularisation gives the best out-of-sample performance?”

Code

library(glmnet)

# glmnet requires predictors as a numeric matrix, not a data frame.
X_train <- as.matrix(train[, -1]) # contains all predictors but remove the outcome variable (y)
y_train <- train$y                # response
X_test  <- as.matrix(test[, -1])
y_test  <- test$y

cv_enet <- cv.glmnet(
  X_train, y_train,
  alpha = 0.5,                    # elastic net
  nfolds = 10
)

# Extract the coefficients from the single elastic net model corresponding to the optimal λ.
coef(cv_enet, s = "lambda.min")

21 x 1 sparse Matrix of class "dgCMatrix"
              lambda.min
(Intercept)  0.066848837
X1           2.012777264
X2           1.921988809
X3           2.043667440
X4          -1.469708904
X5          -1.503814250
X6           0.010503422
X7          -0.111528143
X8          -0.065850506
X9          -0.080697600
X10         -0.232749640
X11          0.010956657
X12          .          
X13          0.120277174
X14          0.166942224
X15          0.237722152
X16          0.030739719
X17          0.058957699
X18         -0.115478884
X19         -0.046143403
X20          0.004132131

Note that even when the true data-generating process is sparse, elastic net does not necessarily recover the true set of predictors, particularly when predictors are correlated. This is not a failure: the elastic net objective is to minimise prediction error, not to identify the true model. At the penalty that optimises cross-validated performance, elastic net may retain many small coefficients because correlated noise variables still carry predictive information. In other words, if allowing small non-zero coefficients on “noise” predictors improves prediction even slightly, elastic net will do that.

Now let’s compare the prediction error from both modelling approaches.

Code

y_pred_enet <- predict(cv_enet, X_test, s = "lambda.min")
rmse_enet <- sqrt(mean((y_test - y_pred_enet)^2))
rmse_enet

[1] 1.999773

Code

c(
  Stepwise_RMSE   = rmse_step,
  ElasticNet_RMSE = rmse_enet
)

  Stepwise_RMSE ElasticNet_RMSE 
       2.018798        1.999773

The elastic net is slightly better, but ultimately they’re pretty similar. But that doesn’t mean that this exercise has been a waste of time. Even when stepwise regression and elastic net achieve similar RMSE on a given test set, elastic net is preferable because it achieves this performance through explicit regularisation rather than discrete variable selection.

Penalised models are:

more stable to sampling variation,
handle correlated predictors more gracefully,
and control overfitting directly.

The apparent equivalence in point performance masks important differences in reliability and robustness, which become evident under resampling or repeated evaluation. Penalised regression does not eliminate sampling variability, but it responds to it smoothly. Unlike stepwise regression, which makes brittle include–exclude decisions, elastic net adjusts coefficients continuously, allowing predictive performance to remain stable even when individual coefficients fluctuate.

All of this is to say that if we ran these models many times across different datasets, penalised regression will still give us a relatively stable model and predictive performance - stepwise methods won’t.

7 A Pragmatic Takeaway

Stepwise regression persists because it is easy, familiar, and superficially tidy. But its apparent simplicity masks deep statistical flaws that undermine inference, reproducibility, and scientific credibility.

Modern alternatives - pre-specification for explanatory modelling; and shrinkage, validation ± model-averaging for prediction modelling - are not only more principled, they are often easier to explain and defend.

The question is no longer whether stepwise regression is flawed. The question is why we continue to use it when better tools are readily available.

If you want to read more about this topic I’d suggest having a look at the following papers:

Well that’s pretty much it for this post folks. Hopefully you can start to incorporate some of these techniques into your next regression modelling endeavour. See you next month.