Model Prediction/Visualisation
3. Non-Linear Functional Forms

code

concept

modelling

visualisation

Relaxing the linearity assumption does not make prediction/visualisation any more difficult.

Published

May 2, 2025

You’ll be glad to know this is going to be a relatively short post. In fact, in discussing aspects of model prediction and visualisation, the non-linear functional form is almost not worthy of a separate post in itself. BUT, I’m going to walk you through an example anyway - mainly to highlight that the process is no more difficult, nor different, to that of predicting and visualising the linear relationship.

To illustrate this, we’ll again use the birthwt dataset. If you recall from the first post in this series, when we considered the modelling of birthweight as a function of maternal age (and later, sex), we made the somewhat unrealistic assumption that maternal age was linearly associated with birthweight. With this constraint in place, the model suggested to us that increasing maternal age was associated with increasing birthweight. At the time I suggested this was biologically implausible and guessed that the association would be more of an inverted U-shape - i.e. birthweight increasing with maternal age to a certain point and then decreasing thereafter.

So, let’s relax the assumption of linearity by including a restricted cubic spline term on maternal age and see what the simplest model with that as the sole predictor shows.

Note

Note that to better capture potential non-linearity, I have increased the ‘resolution’ of maternal age when creating the new data grid. I originally used 5 year intervals which was more than enough when we assumed the relationship was linear, but have now amended that to 1 year intervals.

Our model then becomes:

mod <- lm(bwt ~ ns(age, 3), data = birthwt)

And the visualisation of the model prediction is:

Code

library(tidyverse)
library(splines)
data(birthwt, package = "MASS")
# Model
mod <-  lm(bwt ~ ns(age, 3), data = birthwt)
# Create new data
newdf <-  expand.grid(age = seq(15, 45, by = 1))
# Predict
newdf <-  newdf |> 
  mutate(pred = predict(mod, newdata = newdf))
# Visualise
ggplot(data = newdf, aes(x = age, y = pred)) +
  geom_line(linewidth = 1) +
  geom_point(data = birthwt, aes(x = age, y = bwt)) +
  xlab("Maternal Age") + ylab("Predicted Birthweight") +
  scale_y_continuous(limits = c(0, 5000), breaks = seq(0, 5000, by = 1000)) +
  theme_bw(base_size = 20) +
  theme(legend.position = "bottom")

Ok, so clearly there’s more to this relationship than I had anticipated, but we already have more information than when we modelled the association assuming it was linear. Based on this prediction, birthweight remains relatively stable until the mother’s age approaches 30 years, then increases thereafter. This still seems somewhat counter-intuitive to my mind, and in fact we may have good reason for not paying too much attention to the model prediction at older maternal ages. It turns out that only 20 out of 189 (i.e ~ 10% of mothers) are older than 30 years - and only 1 mother is older than 40 years. In addition, that single data point appears, on cursory examination, to exert considerable influence on the model fit.

What happens if we refit the model without that observation?

Code

library(tidyverse)
library(splines)
data(birthwt, package = "MASS")
# Model
mod <-  lm(bwt ~ ns(age, 3), data = subset(birthwt, age < 40))
# Create new data
newdf <-  expand.grid(age = seq(15, 40, by = 1))
# Predict
newdf <-  newdf |> 
  mutate(pred = predict(mod, newdata = newdf))
# Visualise
ggplot(data = newdf, aes(x = age, y = pred)) +
  geom_line(linewidth = 1) +
  geom_point(data = birthwt, aes(x = age, y = bwt)) +
  xlab("Maternal Age") + ylab("Predicted Birthweight") +
  scale_y_continuous(limits = c(0, 5000), breaks = seq(0, 5000, by = 1000)) +
  theme_bw(base_size = 20) +
  theme(legend.position = "bottom")

This appears a little more sensible than with the inclusion of the last data point, so we’ll leave that observation out of all further calculations. Now let’s also consider adjusting the prediction for potentially confounding factors that could be distorting the relationship as it currently stands. We’ll start by first considering maternal smoking.

The association between maternal age and birthweight adjusted for maternal smoking can be visualised as:

Code

# Relabel smoking
birthwt$smoke <-  factor(birthwt$smoke, levels = c(0,1), labels = c("No", "Yes"))
# Model
mod2 <-  lm(bwt ~ ns(age, 3) * smoke, data = subset(birthwt, age < 40))
# Create new data
newdf <-  expand.grid(age = seq(15, 40, by = 1),
                      smoke = levels(birthwt$smoke))
# Predict
newdf <-  newdf |>
  mutate(pred = predict(mod2, newdata = newdf))
# Visualise
ggplot(data = newdf, aes(x = age, y = pred, color = smoke)) +
  geom_line(linewidth = 1) +
  geom_point(data = birthwt, aes(x = age, y = bwt)) +
  xlab("Maternal Age") + ylab("Predicted Birthweight") + labs(color = "Smoking Status") +
  scale_y_continuous(limits = c(0, 5000), breaks = seq(0, 5000, by = 1000)) +
  theme_bw(base_size = 20) +
  theme(legend.position = "bottom")

Now we can see that amongst smokers, birthweight declines with maternal age. However, we still observe that same oddly increasing profile of birthweight with maternal age amongst the non-smokers. Perhaps other factors are also at play? Let’s just adjust for the full complement as we did in the last post (at least that we can visualise).

Thus, the association between maternal age and birthweight adjusted for maternal smoking, history of hypertension, race and uterine irritability, can be visualised as:

Code

# Relabel history of hypertension
birthwt$ht <-  factor(birthwt$ht, levels = c(0, 1), labels = c("No", "Yes"))
birthwt$race <-  factor(birthwt$race, levels = c(1,2,3), labels = c("White", "Black", "Other"))
birthwt$ui <-  factor(birthwt$ui, levels = c(0, 1), labels = c("No", "Yes"))
# Model
mod3 <-  lm(bwt ~ ns(age,3) * smoke * race * ui * ht, data = subset(birthwt, age < 40))
# Create new data
newdf <-  expand.grid(age = seq(15, 40, by = 1),
                      smoke = levels(birthwt$smoke),
                      race = levels(birthwt$race),
                      ui = levels(birthwt$ui),
                      ht = levels(birthwt$ht))
# Create labels for plotting race
label_race <- c("Race: White", "Race: Black", "Race: Other")
names(label_race) <- c("White", "Black", "Other")
# Create labels for plotting ui
label_ui <- c("Uterine Irritability: No", "Uterine Irritability: Yes")
names(label_ui) <- c("No", "Yes")
# Predict
newdf <-  newdf |> 
  mutate(pred = predict(mod3, newdata = newdf))
# Visualise
ggplot(data = newdf, aes(x = age, y = pred, group = interaction(smoke, ht), color = smoke, linetype = ht)) +
  geom_line(linewidth = 1) +
  geom_point(data = birthwt, aes(x = age, y = bwt)) +
  xlab("Maternal Age") + ylab("Predicted Birthweight") + labs(color = "Smoking Status", linetype = "History of Hypertension") +
  scale_y_continuous(limits = c(0, 5000), breaks = seq(0, 5000, by = 1000)) +
  facet_grid(ui ~ race, labeller = labeller(race = label_race, ui = label_ui)) +
  theme_bw(base_size = 20) +
  theme(legend.position = "bottom")

We’re now seeing that the predictions further depend on the factors adjusted for. I am not going to attempt to explain any of this - my feeling is that there probably isn’t enough data at older maternal ages to come to any firm conclusions about the nature of the true relationship. Furthermore, this is NOT a good model - there is not enough data to support the estimation of all the parameters involved when you interact a spline term with multiple covariates. But that’s ok - ensuring model validity wasn’t really the point of this post. Instead it was to show you the simplicity of predicting and visualising non-linear functional forms. And I hope that I have managed to do that.

Until next time…

--- title: "Model Prediction/Visualisation <br> 3. Non-Linear Functional Forms" date: 2025-05-02 categories: [code, concept, modelling, visualisation] #image: "images/plot.png" description: "Relaxing the linearity assumption does not make prediction/visualisation any more difficult." --- You'll be glad to know this is going to be a relatively short post. In fact, in discussing aspects of model prediction and visualisation, the non-linear functional form is almost not worthy of a separate post in itself. BUT, I'm going to walk you through an example anyway - mainly to highlight that the process is no more difficult, nor different, to that of predicting and visualising the linear relationship. To illustrate this, we'll again use the `birthwt` dataset. If you recall from the first post in this series, when we considered the modelling of birthweight as a function of maternal age (and later, sex), we made the somewhat unrealistic assumption that maternal age was linearly associated with birthweight. With this constraint in place, the model suggested to us that increasing maternal age was associated with increasing birthweight. At the time I suggested this was biologically implausible and guessed that the association would be more of an inverted U-shape - i.e. birthweight increasing with maternal age to a certain point and then decreasing thereafter. So, let's relax the assumption of linearity by including a restricted cubic spline term on maternal age and see what the simplest model with that as the sole predictor shows. ::: callout-note Note that to better capture potential non-linearity, I have increased the 'resolution' of maternal age when creating the new data grid. I originally used `5` year intervals which was more than enough when we assumed the relationship was linear, but have now amended that to `1` year intervals. ::: Our model then becomes: `mod <- lm(bwt ~ ns(age, 3), data = birthwt)` And the visualisation of the model prediction is: ```{r} #| message: false #| warning: false library(tidyverse) library(splines) data(birthwt, package = "MASS") # Model mod <- lm(bwt ~ ns(age, 3), data = birthwt) # Create new data newdf <- expand.grid(age = seq(15, 45, by = 1)) # Predict newdf <- newdf |> mutate(pred = predict(mod, newdata = newdf)) # Visualise ggplot(data = newdf, aes(x = age, y = pred)) + geom_line(linewidth = 1) + geom_point(data = birthwt, aes(x = age, y = bwt)) + xlab("Maternal Age") + ylab("Predicted Birthweight") + scale_y_continuous(limits = c(0, 5000), breaks = seq(0, 5000, by = 1000)) + theme_bw(base_size = 20) + theme(legend.position = "bottom") ``` Ok, so clearly there's more to this relationship than I had anticipated, but we already have more information than when we modelled the association assuming it was linear. Based on this prediction, birthweight remains relatively stable until the mother's age approaches `30` years, then increases thereafter. This still seems somewhat counter-intuitive to my mind, and in fact we may have good reason for not paying too much attention to the model prediction at older maternal ages. It turns out that only `20` out of `189` (i.e \~ `10%` of mothers) are older than `30` years - and only `1` mother is older than `40` years. In addition, that single data point appears, on cursory examination, to exert considerable influence on the model fit. What happens if we refit the model without that observation? ```{r} #| message: false #| warning: false library(tidyverse) library(splines) data(birthwt, package = "MASS") # Model mod <- lm(bwt ~ ns(age, 3), data = subset(birthwt, age < 40)) # Create new data newdf <- expand.grid(age = seq(15, 40, by = 1)) # Predict newdf <- newdf |> mutate(pred = predict(mod, newdata = newdf)) # Visualise ggplot(data = newdf, aes(x = age, y = pred)) + geom_line(linewidth = 1) + geom_point(data = birthwt, aes(x = age, y = bwt)) + xlab("Maternal Age") + ylab("Predicted Birthweight") + scale_y_continuous(limits = c(0, 5000), breaks = seq(0, 5000, by = 1000)) + theme_bw(base_size = 20) + theme(legend.position = "bottom") ``` This appears a little more sensible than with the inclusion of the last data point, so we'll leave that observation out of all further calculations. Now let's also consider adjusting the prediction for potentially confounding factors that could be distorting the relationship as it currently stands. We'll start by first considering maternal smoking. The association between maternal age and birthweight adjusted for maternal smoking can be visualised as: ```{r} #| message: false #| warning: false # Relabel smoking birthwt$smoke <- factor(birthwt$smoke, levels = c(0,1), labels = c("No", "Yes")) # Model mod2 <- lm(bwt ~ ns(age, 3) * smoke, data = subset(birthwt, age < 40)) # Create new data newdf <- expand.grid(age = seq(15, 40, by = 1), smoke = levels(birthwt$smoke)) # Predict newdf <- newdf |> mutate(pred = predict(mod2, newdata = newdf)) # Visualise ggplot(data = newdf, aes(x = age, y = pred, color = smoke)) + geom_line(linewidth = 1) + geom_point(data = birthwt, aes(x = age, y = bwt)) + xlab("Maternal Age") + ylab("Predicted Birthweight") + labs(color = "Smoking Status") + scale_y_continuous(limits = c(0, 5000), breaks = seq(0, 5000, by = 1000)) + theme_bw(base_size = 20) + theme(legend.position = "bottom") ``` Now we can see that amongst smokers, birthweight declines with maternal age. However, we still observe that same oddly increasing profile of birthweight with maternal age amongst the non-smokers. Perhaps other factors are also at play? Let's just adjust for the full complement as we did in the last post (at least that we can visualise). Thus, the association between maternal age and birthweight adjusted for maternal smoking, history of hypertension, race and uterine irritability, can be visualised as: ```{r} #| message: false #| warning: false # Relabel history of hypertension birthwt$ht <- factor(birthwt$ht, levels = c(0, 1), labels = c("No", "Yes")) birthwt$race <- factor(birthwt$race, levels = c(1,2,3), labels = c("White", "Black", "Other")) birthwt$ui <- factor(birthwt$ui, levels = c(0, 1), labels = c("No", "Yes")) # Model mod3 <- lm(bwt ~ ns(age,3) * smoke * race * ui * ht, data = subset(birthwt, age < 40)) # Create new data newdf <- expand.grid(age = seq(15, 40, by = 1), smoke = levels(birthwt$smoke), race = levels(birthwt$race), ui = levels(birthwt$ui), ht = levels(birthwt$ht)) # Create labels for plotting race label_race <- c("Race: White", "Race: Black", "Race: Other") names(label_race) <- c("White", "Black", "Other") # Create labels for plotting ui label_ui <- c("Uterine Irritability: No", "Uterine Irritability: Yes") names(label_ui) <- c("No", "Yes") # Predict newdf <- newdf |> mutate(pred = predict(mod3, newdata = newdf)) # Visualise ggplot(data = newdf, aes(x = age, y = pred, group = interaction(smoke, ht), color = smoke, linetype = ht)) + geom_line(linewidth = 1) + geom_point(data = birthwt, aes(x = age, y = bwt)) + xlab("Maternal Age") + ylab("Predicted Birthweight") + labs(color = "Smoking Status", linetype = "History of Hypertension") + scale_y_continuous(limits = c(0, 5000), breaks = seq(0, 5000, by = 1000)) + facet_grid(ui ~ race, labeller = labeller(race = label_race, ui = label_ui)) + theme_bw(base_size = 20) + theme(legend.position = "bottom") ``` We're now seeing that the predictions further depend on the factors adjusted for. I am not going to attempt to explain any of this - my feeling is that there probably isn't enough data at older maternal ages to come to any firm conclusions about the nature of the true relationship. Furthermore, this is NOT a good model - there is not enough data to support the estimation of all the parameters involved when you interact a spline term with multiple covariates. But that's ok - ensuring model validity wasn't really the point of this post. Instead it was to show you the simplicity of predicting and visualising non-linear functional forms. And I hope that I have managed to do that. Until next time...