Code
| id | group | weight |
|---|---|---|
| 1 | F | 61.59 |
| 2 | F | 64.55 |
| 3 | F | 66.17 |
| 4 | F | 59.31 |
| 5 | F | 64.86 |
| 6 | F | 65.01 |
| 7 | F | 62.85 |
| 8 | F | 62.91 |
| 9 | F | 62.87 |
| 10 | F | 62.22 |
April 19, 2024
I want to share with you a secret - maybe you already know it. It took me a while into my statistical learnings to realise this and since then I’ve seen people write about it (see here and here for examples). But the basic idea is that many of the common statistical tests that we use (e.g. t-test, ANOVA, etc) are really nothing more than variations on the general linear model that we’re all accustomed to:
\[ y = ax + b \]
The former are specific-use tests, whereas the latter is an ‘umbrella’ model that can be broadly adapted to accomplish each of the same tasks - perhaps there’s something to be said for learning just one set of syntax. Let me illustrate this to you with one example using the two-sample t-test. We’ll use the genderweight dataset from the datarium package in R which consists of the bodyweights of 40 subjects (20 males, 20 females). We’re interested in working out whether there is a gender difference. A look at the data shows:
| id | group | weight |
|---|---|---|
| 1 | F | 61.59 |
| 2 | F | 64.55 |
| 3 | F | 66.17 |
| 4 | F | 59.31 |
| 5 | F | 64.86 |
| 6 | F | 65.01 |
| 7 | F | 62.85 |
| 8 | F | 62.91 |
| 9 | F | 62.87 |
| 10 | F | 62.22 |
It’s always helpful to first plot the data:

Now, we can run our standard t-test as follows (by default, computing the Welch version of the test which does not assume the same variances in each group). In words, we are asking to test the difference in weight by group (i.e. males vs females).
t.test(weight ~ group, data = genderweight)
Welch Two Sample t-test
data: weight by group
t = -20.791, df = 26.872, p-value < 2.2e-16
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
-24.53135 -20.12353
sample estimates:
mean in group F mean in group M
63.49867 85.82612
This output tells us that the mean weight in females and males is 63.5 kg and 85.8 kg, respectively. Furthermore, the 95% C.I. for the difference (note that is does not give us the actual difference) in those two weights is -24.5, -20.1 and as the interval does not contain 0 this is statistically significant (as also reflected in the p-value).
Now, the equivalent linear model (i.e. linear regression) in R is simply:
summary(lm(weight ~ group, data = genderweight))
Call:
lm(formula = weight ~ group, data = genderweight)
Residuals:
Min 1Q Median 3Q Max
-8.8163 -1.3647 -0.4869 1.3980 9.2365
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.4987 0.7593 83.62 <2e-16 ***
groupM 22.3274 1.0739 20.79 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.396 on 38 degrees of freedom
Multiple R-squared: 0.9192, Adjusted R-squared: 0.9171
F-statistic: 432.3 on 1 and 38 DF, p-value: < 2.2e-16
The output is slightly different but the information contained is almost the same. (Intercept) represents the mean weight in the reference category of the group variable (in this case females). groupM represents the difference in means between females and males (22.3 kg). Note that the 95% C.I.’s aren’t presented as part of this standard output, but we can obtain that information easily enough with:
confint(lm(weight ~ group, data = genderweight))
2.5 % 97.5 %
(Intercept) 61.96145 65.03589
groupM 20.15349 24.50140
Note the slight difference in the 95% C.I.’s to that obtained from the t-test. The general linear model, by assumption, assumes homogeneity of variances among the two groups.
Finally, if you would prefer to know the actual mean values of each group as well, it’s possible to amend the lm call slightly by removing the intercept term. This gives:
summary(lm(weight ~ group - 1, data = genderweight))
Call:
lm(formula = weight ~ group - 1, data = genderweight)
Residuals:
Min 1Q Median 3Q Max
-8.8163 -1.3647 -0.4869 1.3980 9.2365
Coefficients:
Estimate Std. Error t value Pr(>|t|)
groupF 63.4987 0.7593 83.62 <2e-16 ***
groupM 85.8261 0.7593 113.03 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.396 on 38 degrees of freedom
Multiple R-squared: 0.9981, Adjusted R-squared: 0.998
F-statistic: 9884 on 2 and 38 DF, p-value: < 2.2e-16
The two-sample t-test is just one example of a special case of the general linear model. The first link I provided above contains a neat pdf describing many other special cases and I would encourage you to have a look at these. While you might still use these specific tests in your day to day work, it is nonetheless helpful to broaden your statistical knowledge in the realisation that the general linear model is fundamental to all of these.
---
title: "Everything is a Linear Model"
date: 2024-04-19
categories: [analysis, concept, code, modelling]
description: "The t-test and linear model with one grouping variable are two sides of the same coin."
---
I want to share with you a secret - maybe you already know it. It took me a while into my statistical learnings to realise this and since then I've seen people write about it (see [here](https://lindeloev.github.io/tests-as-linear/) and [here](https://danielroelfs.com/blog/everything-is-a-linear-model/) for examples). But the basic idea is that many of the common statistical tests that we use (e.g. t-test, ANOVA, etc) are really nothing more than variations on the general linear model that we're all accustomed to:
$$ y = ax + b $$
The former are specific-use tests, whereas the latter is an 'umbrella' model that can be broadly adapted to accomplish each of the same tasks - perhaps there's something to be said for learning just one set of syntax. Let me illustrate this to you with one example using the two-sample t-test. We'll use the `genderweight` dataset from the `datarium` package in R which consists of the bodyweights of 40 subjects (20 males, 20 females). We're interested in working out whether there is a gender difference. A look at the data shows:
```{r}
#| label: data
#| message: false
library(ggplot2)
library(kableExtra)
data("genderweight", package = "datarium")
head(genderweight, 10) |>
kable(align = "c", digits = 2)
```
# Plot the Data
It's always helpful to first plot the data:
```{r}
#| label: plot
#| message: false
ggplot(genderweight, aes(x = group, y = weight)) +
geom_jitter(size = 3, width = 0.05) +
scale_y_continuous(limits = c(50, 100), breaks = seq(50, 100, by = 10)) +
stat_summary(fun = mean,
geom = "errorbar",
aes(ymax = after_stat(y), ymin = after_stat(y)),
width = 0.25) +
theme_bw(base_size = 20)
```
# Two-Sample t-Test
Now, we can run our standard t-test as follows (by default, computing the Welch version of the test which does not assume the same variances in each group). In words, we are asking to test the difference in weight by group (i.e. males vs females).
`t.test(weight ~ group, data = genderweight)`
```{r}
#| label: ttest
#| message: false
t.test(weight ~ group, data = genderweight)
```
This output tells us that the mean weight in females and males is `63.5 kg` and `85.8 kg`, respectively. Furthermore, the 95% C.I. for the difference (note that is does not give us the actual difference) in those two weights is `-24.5, -20.1` and as the interval does not contain `0` this is statistically significant (as also reflected in the p-value).
# Linear Model
Now, the equivalent linear model (i.e. linear regression) in `R` is simply:
`summary(lm(weight ~ group, data = genderweight))`
```{r}
#| label: lm
#| message: false
summary(lm(weight ~ group, data = genderweight))
```
The output is slightly different but the information contained is almost the same. `(Intercept)` represents the mean weight in the reference category of the `group` variable (in this case females). `groupM` represents the difference in means between females and males (`22.3 kg`). Note that the 95% C.I.'s aren't presented as part of this standard output, but we can obtain that information easily enough with:
`confint(lm(weight ~ group, data = genderweight))`
```{r}
#| label: ci
#| message: false
confint(lm(weight ~ group, data = genderweight))
```
Note the slight difference in the 95% C.I.'s to that obtained from the t-test. The general linear model, by assumption, assumes homogeneity of variances among the two groups.
Finally, if you would prefer to know the actual mean values of each group as well, it's possible to amend the `lm` call slightly by removing the intercept term. This gives:
`summary(lm(weight ~ group - 1, data = genderweight))`
```{r}
#| label: lm2
#| message: false
summary(lm(weight ~ group - 1, data = genderweight))
```
The two-sample t-test is just one example of a special case of the general linear model. The first link I provided above contains a neat pdf describing many other special cases and I would encourage you to have a look at these. While you might still use these specific tests in your day to day work, it is nonetheless helpful to broaden your statistical knowledge in the realisation that the general linear model is fundamental to all of these.