Model Prediction/Visualisation
1. The Basics

code
concept
modelling
visualisation
How to predict from your model.
Published

March 21, 2025

Most of you will be familiar with how to create and run a statistical model in R. But what do you then do with your model results? I bet you take R’s rather bland console output, convert that to a formatted table in your manuscript, then based on the regression coefficient values, go on and talk about associations of your exposure/s with your outcome.

And leave it at that.

Well, the old adage of “a picture is worth a thousand words” has never been truer in the context of model prediction and how much additional information value you can obtain if you take your analyses a step further. Our reliance on just looking at the numbers (i.e. the regression coefficients) is sufficient a lot of the time in simple cases, but when models become a more complex - for example, by the specification of non-linear covariate functional forms via splines - regression coefficients on their own are essentially useless in aiding our interpretation. Plotting the model predictions, however, can put us back on track for understanding the important relationships in our data. Even a complex model can make more sense when we can visualise its predictions. These are skills worth learning and adding to your data analyst toolkit.

So, I thought I would put together a series of posts on predicting from and visualising your model, where in each post we cover a different area related to the subject. This week it’s just about the basics of model prediction and visualisation, but in future posts I hope to extend those ideas to more advanced topics including non-linear functional forms, glm’s, mixed-models, etc (let’s see how we go). But first, let’s start at the start…

1 What is Model Prediction

When we specify a model using traditional lm() syntax in R, remember that we are just re-expressing the fundamental regression equation you will find in any statistical textbook:

y^=β0+β1x1+β2x2+...

Where:

β0 is the intercept (the expected value of y when x1 and x2 are both equal to 0);

β1 and β2 are the regression coefficients for x1 and x2, respectively, and;

β1 is the expected change in y for a one-unit change in x1 (when x2 is held at a constant value);

β2 is the expected change in y for a one-unit change in x2 (when x1 is held at a constant value).

The coefficents in the regression equation define the relationship between each independent variable (exposure/predictor) and the dependent variable (outcome). But you don’t need to stop there - it’s very easy to enter values for the independent variables into the equation to also predict the expected (mean) value of the dependent variable. Let’s take a look at a very simple example that uses the built-in mtcars dataset in R. Say we want to examine the relationship between fuel economy (miles/gallon [mpg]) and weight and transmission type (wt + am). A standard linear regression gives:

Code
library(tidyverse)
library(kableExtra)
library(gtsummary)
data(mtcars)
tbl_regression(lm(mpg ~ wt + am, data = mtcars), intercept = T)
Characteristic Beta 95% CI1 p-value
(Intercept) 37 31, 44 <0.001
wt -5.4 -7.0, -3.7 <0.001
am -0.02 -3.2, 3.1 >0.9
1 CI = Confidence Interval

Interpreting just the regression coefficients, we could say that on average, a one-unit increase in weight (which corresponds here to 1000 pounds) is associated with a reduction in fuel economy of 5.4 mpg when transmission type is held constant. Similary, we can also say that on average, a one-unit increase in transmission type (which corresponds here to simply changing from an automatic to manual car) is associated with a barely notable reduction in fuel economy of 0.02 mpg, when weight is held constant.

But we can use this model to make some simple predictions as well. You might want to know what is the expected fuel economy for a manual, 5000 pound car (not very good I would think). Well it’s a simple exercise in substitution:

y^=37.35.450.020=10.3

Ok, what about the economy of a much lighter, automatic, 2000 pound vehicle?

y^=37.35.420.021=26.5

That’s all that prediction is, really - substitution into your regression equation. Prediction in this context isn’t about predicting the future - in fact for most simple models, time plays no role (although it may, depending on the research question and the model we are using). We are typically more interested in predicting the mean value of the dependent variable given specific values of the independent variable/s.

While you can predict for any combination of covariate values, the visualisation of model predictions comes into its own when we focus on the relationship between some exposure and an outcome, holding the values of all other covariates in the model fixed. These are, in other words, adjusted predictions, and can help you to answer the question: “What is the expected outcome for certain values or levels of my exposure, when I have removed any other potential confounding effects?”

Alright, I think you get enough of an idea that we can actually now take a look at a more concrete example.

2 birthwt Dataset

In our very first example to show you how to get up and running with predicting and visualising from your model, we will use the birthwt dataset that comes with the MASS package. A listing of the variables can be found here, and the first 10 rows of the dataframe look like:

Code
data(birthwt, package = "MASS")
head(birthwt, 10) |> 
  kable(align = "c", digits = 2)
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
87 0 20 105 1 1 0 0 0 1 2557
88 0 21 108 1 1 0 0 1 2 2594
89 0 18 107 1 1 0 0 1 0 2600
91 0 21 124 3 0 0 0 0 0 2622
92 0 22 118 1 0 0 0 0 1 2637
93 0 17 103 3 0 0 0 0 1 2637
94 0 29 123 1 1 0 0 0 1 2663
95 0 26 113 1 1 0 0 0 0 2665

2.1 Birthweight as a Function of Maternal Age

Let’s initially consider the possible association between baby birthweight (bwt - outcome) and mother’s age (exposure). We will run the model as follows:

Code
mod <-  lm(bwt ~ age, data = birthwt)
tbl_regression(mod, intercept = T)
Characteristic Beta 95% CI1 p-value
(Intercept) 2,656 2,185, 3,127 <0.001
age 12 -7.3, 32 0.2
1 CI = Confidence Interval

The model seems to suggest that increasing maternal age is associated with increasing birthweight. Keep in mind that this is unlikely to represent reality and I would guess that the relationship between maternal age and birthweight is non-linear (i.e. at some maternal age, birthweight is more likely to start decreasing with increasing maternal age). But for the sake of today’s exercise let’s just assume the relationship is linear.

2.1.1 Predict

Calculating predictions from the model is as simple as using the predict() function following your model call, adding the predictions as a new column to the original dataframe (or creating a new dataframe if you desire). Here I will just add the new column containing the predictions to a new dataframe containing only the relevant variables, and print the first 10 rows:

Code
birthwt2 <-  birthwt |> 
  select(bwt, age) |> 
  mutate(pred = predict(mod))
head(birthwt2, 10) |> 
  kable(align = "c", digits = 1)
bwt age pred
85 2523 19 2891.9
86 2551 33 3065.9
87 2557 20 2904.3
88 2594 21 2916.8
89 2600 18 2879.5
91 2622 21 2916.8
92 2637 22 2929.2
93 2637 17 2867.0
94 2663 29 3016.2
95 2665 26 2978.9

2.1.2 Visualise

Plotting the model predictions is simple using ggplot2.

Code
ggplot(birthwt2, aes(x = age, y = pred)) +
  geom_line(linewidth = 1)

It can be useful to add in the raw datapoints, so let’s do that, and I’ll add a couple of plotting embellishments as well.

Code
ggplot(data = birthwt2, aes(x = age, y = pred)) +
  geom_line(linewidth = 1) +
  geom_point(data = birthwt2, aes(x = age, y = bwt)) +
  xlab("Maternal Age") + ylab("Predicted Birthweight") +
  theme_bw(base_size = 20)

Great! We’ve generated predictions from our model and visualised the results. That wasn’t too hard.

2.2 Birthweight as a Function of Maternal Age and Smoking

Let’s take it up a step by now including maternal smoking (smoke) as another exposure in our model:

Code
mod2 <-  lm(bwt ~ age + smoke, data = birthwt)
tbl_regression(mod2, intercept = T)
Characteristic Beta 95% CI1 p-value
(Intercept) 2,791 2,316, 3,267 <0.001
age 11 -8.2, 31 0.3
smoke -278 -489, -67 0.010
1 CI = Confidence Interval

So, the coefficient for age hasn’t really changed that much, but we can see that smoking seems to have a significant impact, decreasing birthweight on average by 278 grams.

Let’s predict birthweight from this model as we did before:

Code
birthwt2 <-  birthwt |> 
  select(bwt, age, smoke) |> 
  mutate(pred = predict(mod2))
head(birthwt2, 10) |> 
  kable(align = "c", digits = 1)
bwt age smoke pred
85 2523 19 0 3005.7
86 2551 33 0 3163.8
87 2557 20 1 2738.7
88 2594 21 1 2749.9
89 2600 18 1 2716.1
91 2622 21 0 3028.3
92 2637 22 0 3039.6
93 2637 17 0 2983.1
94 2663 29 1 2840.3
95 2665 26 1 2806.4

And plot the predictions as we did before:

Code
ggplot(data = birthwt2, aes(x = age, y = pred)) +
  geom_line(linewidth = 1) +
  geom_point(data = birthwt2, aes(x = age, y = bwt)) +
  xlab("Maternal Age") + ylab("Predicted Birthweight") +
  theme_bw(base_size = 20)

Uh-oh! Something has obviously gone badly wrong.

The problem is that we have predicted on the original data and the predictions are the result of the combination of each woman’s unique values of both maternal age and smoking status. In other words, we are not just plotting the predicted value of birthweight as a function of maternal age alone, but rather as a function of both maternal age and smoking status. We are then trying to visualise those predictions as if they were generated from a function of maternal age alone.

As you can then see - “Computer says no”.

To properly visualise predictions when more than one predictor are involved, requires that we find a way to generate the adjusted predictions I mentioned earlier. And, of course, there’s a fairly easy way to do that in R.

2.2.1 Predict

When you have more than one covariate in your model, predicting on new data will allow you to generate the adjusted predictions that facilitate sensible visualisation. The predict() function actually contains a newdata argument that you can point towards a ‘grid’ of specified covariate values to predict on. This ‘grid’ is essentially a new dataframe that contains all possible combinations of covariates at salient values (typically a range of values at set intervals for numeric variables and every category/level for factor variables). In this way (adjusted) predictions can be made for one variable while keeping the other variable/s at fixed values.

To construct the grid of new data, we will use the expand.grid() function. I will specify a maternal age range from 15 to 45 in 5 year intervals and a smoking status variable equal to either 0 or 1. This grid of new data (which I have called newdf) looks like:

Code
newdf <-  expand.grid(age = seq(15, 45, by = 5),
                      smoke = c(0,1))
newdf |> 
  kable(align = "c", digits = 1)
age smoke
15 0
20 0
25 0
30 0
35 0
40 0
45 0
15 1
20 1
25 1
30 1
35 1
40 1
45 1
Important

Note two conditions in generating your grid of new data:
1. You MUST include every variable that is specified in your model;
2. The variable names MUST match the names of the corresponding variables in your original dataset.
Failing these, predict() will become upset with you and refuse to do any more work.

Now we are in a position to generate the predictions that we need. Nothing changes compared to before, except we now specify a new dataframe to predict upon using the newdata argument:

Code
newdf <-  newdf |> 
  mutate(pred = predict(mod2, newdata = newdf))
newdf |> 
  kable(align = "c", digits = 1)
age smoke pred
15 0 2960.6
20 0 3017.0
25 0 3073.5
30 0 3129.9
35 0 3186.4
40 0 3242.8
45 0 3299.3
15 1 2682.2
20 1 2738.7
25 1 2795.1
30 1 2851.6
35 1 2908.0
40 1 2964.5
45 1 3020.9

2.2.2 Visualise

The key when we come to plotting the predictions is to now realise that because we are representing age on the x-axis, we need to use some other means to represent smoking status. In this case we can do that quite easily by using the colour aesthetic in our ggplot call:

Code
ggplot(data = newdf, aes(x = age, y = pred, color = factor(smoke))) +
  geom_line(linewidth = 1) +
  geom_point(data = birthwt2, aes(x = age, y = bwt), size = 2) +
  xlab("Maternal Age") + ylab("Predicted Birthweight") + labs(color = "Smoking Status") +
    scale_y_continuous(limits = c(0, 5000), breaks = seq(0, 5000, by = 1000)) +
  theme_bw(base_size = 20) +
  theme(legend.position = "bottom")

Note that I converted smoking status to a factor variable on the fly so that ggplot knows that it only has two levels (otherwise it assumes it’s a continuous variable).

So, what do we actually have here, when I call this a plot of adjusted predictions? We have two parallel regression lines - one for smokers (blue) and one for non-smokers (red). They are parallel because ‘effects’ are additive - we did not specify an interaction between maternal age and smoking status. The red line is the predicted birthweight as a function of maternal age for non-smoking mothers and the blue line is the predicted birthweight as a function of maternal age for smoking mothers. In other words, predictions of the association between maternal age and newborn birthweight, adjusted for maternal smoking status (we can flip this around and also say that these are predictions of the association between smoking status and newborn birthweight, adjusted for maternal age). The regression line/s no longer zig-zag, like they did in the earlier plot, because we have separated out the ‘effect’ of the second exposure from the first allowing a ‘pure’ association between maternal age and newborn birthweight at each level of smoking status to be observed.

These same principles of prediction and visualisation can be applied in nearly all model plotting endeavours. There does, however, become a limit in the number of covariates that can practically be visualised and that is something we will look at in more detail in the next post.

I will conclude by saying that for this series I am going to illustrate all concepts using what I consider to be ‘first principles’ - that is, applying the predict() and expand.grid() functions for predicting, and ggplot2 for visualising. But there may be better/faster ways to do this and a package I came across recently that I think would be worthwhile exploring is ggeffects. I haven’t used this myself but a quick scan of the package website gives me the impression this will make even shorter work of my ‘first principles’ approach. It is a package that I will keep in mind as we go forward on this topic.

Catch you in a couple of weeks…