janitor - Your local R handypackage

code

Work smarter, not harder.

Published

March 7, 2025

A (fairly) short post today. I want to introduce you to a great little R package that I load in every one of my scripts, because I never know when I might need to use it. janitor contains several functions, but there are two primary ones that I want to draw your attention to - clean_names and tabyl.

1 clean_names()

If you have control over the names of the variables in your dataset, can I please implore you to adopt one of the two recommended naming conventions - camelCase or snake_case. I prefer the latter for its readability, so I would suggest that one if you don’t have a preference. A somewhat frustrating (and completely avoidable) issue that I sometimes have to deal with when being sent a dataset, is to spend time renaming variables. White spaces in variable names are by biggest bugbear, but the inclusion of symbols, the capitalisation of some letters, and the use of repeated identical names, follow closely behind. R doesn’t like any of this.

If proactive measures aren’t possible, we can use clean_names in a reactive manner. I will use the same example provided by the package author in their vignette to illustrate how useful clean_names can be. Let’s say we receive a spreadsheet that we import into our R workspace with the following column names:

Code

library(tidyverse)
library(janitor)
# Create a data.frame with dirty names
test_df <- as.data.frame(matrix(ncol = 6))
names(test_df) <- c("firstName", "ábc@!*", "% successful (2009)",
                    "REPEAT VALUE", "REPEAT VALUE", "")
names(test_df)

[1] "firstName"           "ábc@!*"              "% successful (2009)"
[4] "REPEAT VALUE"        "REPEAT VALUE"        ""

Horrible!

Fixing this, however, is as simple as running clean_names (snake_case is the default) on the dataframe:

Code

test_df <-  clean_names(test_df)
names(test_df)

[1] "first_name"              "abc"                    
[3] "percent_successful_2009" "repeat_value"           
[5] "repeat_value_2"          "x"

Beautiful!

There is some customisation available within the function and I would encourage you to check that out too.

2 tabyl()

R’s base table() command gets the job done, but I dare anyone to say it does it nicely. Having some experience with Stata, its equivalent tabulate command provides the readability/presentability luxury of a Rolls-Royce compared to table()’s cramped and bland version of a Datsun Sunny 120Y (the car my good friend drove us around in while we were at Uni BTW - as I said, it gets the job done!)

Let me give you an example. We’ll use the Star Wars dataset that comes with the dplyr package. We’re interested in first tabulating the categories of eye colour in humans, and then after that, cross-tabulating eye-colour with gender.

2.1 One-way Tabulation

Base R’s tabulation of eye-colour gives:

Code

# Load data
humans <- starwars |> 
  filter(species == "Human")
# Base R table
table(humans$eye_color)


     blue blue-gray     brown      dark     hazel   unknown    yellow 
       12         1        16         1         2         1         2

You can imagine that with a factor that contains more categories, the listing starts to spill out over another row (sometimes more) in your console. It’s just not pleasant to quickly look at and interpret. In comparison, in its most basic call, tabyl() gives:

Code

humans |> 
  tabyl(eye_color, show_na = T)

# A tibble: 7 × 3
  eye_color     n percent
  <chr>     <int>   <dbl>
1 blue         12  0.343 
2 blue-gray     1  0.0286
3 brown        16  0.457 
4 dark          1  0.0286
5 hazel         2  0.0571
6 unknown       1  0.0286
7 yellow        2  0.0571

It prints in a much more readable column format, and provides percentages, as well as frequencies, by default.

To save keystrokes, I have written a function for myself with a few further embellishments - marginal totals as well as makeshift borders (note you will need to have installed the knitr package if you want to use this). My aim with this and the subsequent function I am going to share with you was to emulate Stata’s tabulate command as much as possible.

Code

tab1 <- function(df = dat, var, show.miss = T) {
  df |> 
    tabyl({{var}}, show_na = show.miss) |> 
    adorn_totals("row") |> 
    adorn_pct_formatting(rounding = "half up", digits = 2) |> 
    knitr::kable()
}

Then, with a simple tab1(humans, eye_color, show.miss = T) I can get:

Code

tab1(humans, eye_color, T)

The show.miss = T will include any missing data as a separate category. You can set this to F if you prefer.

2.2 Two-way Tabulation

Let’s extend the example further by now cross-tabulating eye-colour with gender. The Base R version gives:

Code

table(humans$eye_color, humans$gender)

           
            feminine masculine
  blue             3         9
  blue-gray        0         1
  brown            4        12
  dark             0         1
  hazel            1         1
  unknown          1         0
  yellow           0         2

This is actually more readable than the one-way tabulation. In comparison, tabyl() produces:

Code

humans |>
  tabyl(eye_color, gender, show_na = T)

# A tibble: 7 × 3
  eye_color feminine masculine
  <chr>        <dbl>     <dbl>
1 blue             3         9
2 blue-gray        0         1
3 brown            4        12
4 dark             0         1
5 hazel            1         1
6 unknown          1         0
7 yellow           0         2

In its default state, not that different I agree. However, to save some time, and also include a couple of useful additional summaries, I have written myself a function for cross-tabulations as well. Again, I have customised this to my liking, by including marginal totals and bordering. Note that as I have specified row percentages, the marginal row totals all add up to 100%.

Code

tab2 <- function(df = dat, var1, var2, show.miss = T) {
  df |>
    tabyl({{var1}}, {{var2}}, show_na = show.miss) |>
    adorn_totals("row") |>
    adorn_totals("col") |>
    adorn_percentages("row") |>
    adorn_pct_formatting(rounding = "half up", digits = 2) |>
    adorn_ns() |>
    knitr::kable()
}

Now I simply type tab2(humans, eye_color, gender, show.miss = T) to get:

Code

tab2(humans, eye_color, gender, show.miss = T)

I better stop there. How is it that my ‘short’ posts even somehow become long? Until next time…