R Programming - Try for DRY

code

concept

A general philosophy on making your code more efficient.

Published

November 15, 2024

Let’s face it - we could all become better R coders. Although I have been using R for quite a few years, there is still so much I could learn to make my code-writing life easier. The problem is that I consider myself a statistician before a programmer, and if I’m going to spend some time in learning something new, I tend to focus more on the statistics side of the job, rather than the syntax. A side-effect of this is that while I can no doubt code, I don’t think I code as efficiently as I could. Rather, I think I have become somewhat of a ‘lazy’ coder. What do I mean by that? Well, I can always write code to get achieve a result - there’s no issue with that - but I find that I can repeat myself a lot. And I tend to do that because, in the moment I think that’s easier than investing a little more time to work out a way to avoid that repetition. Consequently, I find that I may save time initially but this sometimes comes back to bite me later, when I discover either an error or am requested to update the existing code. Doing either of these things means that I have to change every bit of duplicated code, instead of doing it (ideally) just once.

So this is where the idea of DRY (Don’t Repeat Yourself) comes in. DRY is a general principle in software development, but applies equally in coding in R. Undoubtedly the most common realisation of the DRY principle in coding is the creation of a function for re-used logic. This is summed up in the book R for Data Science as:

“A good rule of thumb is to consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).”

In providing some foundation for how one might become more DRY in their daily work, I’m going to share with you some thoughts for how I approached a recent coding problem I was presented with (indulge my use of the acronym in this way). This will be somewhat of a holistic overview - I am going to focus less on the detail of the actual code and more on the thought process of taking the overall task and formulating a way to break that task down into smaller steps that can then be operated on in a way which avoids all of that nasty repetition.

Note

Of course, there are many, many ways to achieve a coding goal in R and I do not consider what I am presenting to necessarily be the best way. These are, but some approaches of many…

1 The Data

A small subset of the data we are going to work on is shown below.

Code

library(tidyverse)
library(kableExtra)
# Read in data
dat <-  structure(list(
  id = 1:5, 
  var_1 = c("1,1,1,1", "1,1,1,1", "0,0,0,0", "1,1,1,1", "1,1,1,-11000"), 
  var_2 = c("0,8,0,0", "0,0,0,0", "-13000,-13000,-13000,-13000", "0,0,0,0", "0,0,20,-13000"), 
  var_3 = c("0,0,0,0", "0,0,1,2", "-13000,-13000,-13000,-13000", "0,0,0,0", "0,0,0,-13000"), 
  var_4 = c("38,30,38,30", "24,34,28,30", "-13000,-13000,-13000,-13000", "50,60,60,61", "40,16,20,-13000"), 
  var_5 = c("2,4,1,3", "1,1,1,3", "-13000,-13000,-13000,-13000", "0,3,3,0", "4,6,7,-13000"), 
  var_6 = c("4,10,1,1", "1,2,1,3", "-13000,-13000,-13000,-13000", "0,1,0,0", "6,8,4,-13000")), 
  row.names = 1:5, class = "data.frame")
dat |> 
  kable(align = "c", digits = 2)

id	var_1	var_2	var_3	var_4	var_5	var_6
1	1,1,1,1	0,8,0,0	0,0,0,0	38,30,38,30	2,4,1,3	4,10,1,1
2	1,1,1,1	0,0,0,0	0,0,1,2	24,34,28,30	1,1,1,3	1,2,1,3
3	0,0,0,0	-13000,-13000,-13000,-13000	-13000,-13000,-13000,-13000	-13000,-13000,-13000,-13000	-13000,-13000,-13000,-13000	-13000,-13000,-13000,-13000
4	1,1,1,1	0,0,0,0	0,0,0,0	50,60,60,61	0,3,3,0	0,1,0,0
5	1,1,1,-11000	0,0,20,-13000	0,0,0,-13000	40,16,20,-13000	4,6,7,-13000	6,8,4,-13000

Code

glimpse(dat)

Rows: 5
Columns: 7
$ id    <int> 1, 2, 3, 4, 5
$ var_1 <chr> "1,1,1,1", "1,1,1,1", "0,0,0,0", "1,1,1,1", "1,1,1,-11000"
$ var_2 <chr> "0,8,0,0", "0,0,0,0", "-13000,-13000,-13000,-13000", "0,0,0,0", …
$ var_3 <chr> "0,0,0,0", "0,0,1,2", "-13000,-13000,-13000,-13000", "0,0,0,0", …
$ var_4 <chr> "38,30,38,30", "24,34,28,30", "-13000,-13000,-13000,-13000", "50…
$ var_5 <chr> "2,4,1,3", "1,1,1,3", "-13000,-13000,-13000,-13000", "0,3,3,0", …
$ var_6 <chr> "4,10,1,1", "1,2,1,3", "-13000,-13000,-13000,-13000", "0,1,0,0",…

So what do we have here?

There are 7 columns and the first is simply an id variable. The remaining 6 variables are labelled var_1:var_6 and each contain a ‘string’ of numbers in each cell. These represent longitudinal measurements for 6 different quantities at 4 different time points - baseline, and 3, 6 and 12 months. So, Subject 2 has recorded measurements of 24, 34, 28 and 30 at baseline, 3, 6 and 12 months, respectively. Each variable is recorded in character format, so R does not yet recognise these as numbers. You can hopefully appreciate that presenting data in this format is not really ideal (somewhat painful in fact) and that we’ve got a bit of work to do to make this analysis-ready.

2 Breaking The Task Down Into Steps

So what is the actual task that we need to perform on these data, to make them analysis-ready?

Basically, we want to convert the existing dataframe into long format, so that each cell contains only one value and the longitudinal measures appear on successive rows. In other words, we want the following:

id	timepoint	var_1	var_2	var_3	var_4	var_5	var_6
1	base	1	0	0	38	2	4
1	3mo	1	8	0	30	4	10
1	6mo	1	0	0	38	1	1
1	12mo	1	0	0	30	3	1
2	base	1	0	0	24	1	1
2	3mo	1	0	0	34	1	2
2	6mo	1	0	1	28	1	1
2	12mo	1	0	2	30	3	3
3	base	0	-13000	-13000	-13000	-13000	-13000
3	3mo	0	-13000	-13000	-13000	-13000	-13000
3	6mo	0	-13000	-13000	-13000	-13000	-13000
3	12mo	0	-13000	-13000	-13000	-13000	-13000
4	base	1	0	0	50	0	0
4	3mo	1	0	0	60	3	1
4	6mo	1	0	0	60	3	0
4	12mo	1	0	0	61	0	0
5	base	1	0	0	40	4	6
5	3mo	1	0	0	16	6	8
5	6mo	1	20	0	20	7	4
5	12mo	-11000	-13000	-13000	-13000	-13000	-13000

Before rushing head-long into coding up an approach, let’s take a little time to think about what steps we might take to accomplish this goal. What are some self-contained data manipulations we can extract out of the task to make the task seem less imposing, when we subsequently string those manipulations together.

2.1 Step 1

Well the most obvious first thing we could do is to split each column that contains multiple values into multiple columns each containing one value. separate_wider_delim() is a tidyr function that will allow us to do that and the code and output when applied to var_1 is shown below (note that we would need to repeat this for every variable).

temp <-  dat |> 
  select(id, var_1) |> 
  separate_wider_delim(var_1, ",", too_few = "align_start",
                       names = c("var_1_base", "var_1_3mo", "var_1_6mo", "var_1_12mo"))
temp |> 
  kable(align = "c", digits = 2)

id	var_1_base	var_1_3mo	var_1_6mo	var_1_12mo
1	1	1	1	1
2	1	1	1	1
3	0	0	0	0
4	1	1	1	1
5	1	1	1	-11000

That seems to have worked quite well - we now have each value in its own cell!

2.2 Step 2

Now that we’ve found a way to split a column, we can take that output and convert it to long format. We can use pivot_longer to do that very easily (note that we would need to repeat this for every variable).

temp_long <-  temp |> 
  pivot_longer(2:5,
               names_to = "timepoint",
               values_to = "var_1",
               names_prefix = "var_1_")
temp_long |> 
  kable(align = "c", digits = 2)

id	timepoint	var_1
1	base	1
1	3mo	1
1	6mo	1
1	12mo	1
2	base	1
2	3mo	1
2	6mo	1
2	12mo	1
3	base	0
3	3mo	0
3	6mo	0
3	12mo	0
4	base	1
4	3mo	1
4	6mo	1
4	12mo	1
5	base	1
5	3mo	1
5	6mo	1
5	12mo	-11000

2.3 Step 3

Really, we’re nearly there. All that’s left to do at this point is to join the resulting columns together from performing Steps 1 and 2 on each of the remaining variables in the dataset. There are multiple ways you can do that in R.

3 Coding Approaches

Now that we have some basic idea of the steps that we need to take, let’s consider some different coding approaches that we might use to perform the task at hand. All of these will give the same end result.

3.1 Not DRY

In this approach we are going to incorporate the above 3 steps into a single workflow. In doing this we will create 6 temporary dataframes to store the converted contents of each original variable. We will then join the resulting dataframes together. We can actually make this a little more succinct by combining Steps 1 and 2 in a single call, but even so, this approach leads to a LOT of repetition - 55 lines of code to be exact.

# Create individual dataframes with each original column split into multiple columns and converted to long format
temp1 <-  dat |> 
  select(id, var_1) |> 
  separate_wider_delim(var_1, ",", too_few = "align_start",
                       names = c("var_1_base", "var_1_3mo", "var_1_6mo", "var_1_12mo")) |> 
  pivot_longer(2:5,
               names_to = "timepoint",
               values_to = "var_1",
               names_prefix = "var_1_")
temp2 <-  dat |> 
  select(id, var_2) |> 
  separate_wider_delim(var_2, ",", too_few = "align_start",
                       names = c("var_2_base", "var_2_3mo", "var_2_6mo", "var_2_12mo")) |> 
  pivot_longer(2:5,
               names_to = "timepoint",
               values_to = "var_2",
               names_prefix = "var_2_")
temp3 <-  dat |> 
  select(id, var_3) |> 
  separate_wider_delim(var_3, ",", too_few = "align_start",
                       names = c("var_3_base", "var_3_3mo", "var_3_6mo", "var_3_12mo")) |> 
  pivot_longer(2:5,
               names_to = "timepoint",
               values_to = "var_3",
               names_prefix = "var_3_")
temp4 <-  dat |> 
  select(id, var_4) |> 
  separate_wider_delim(var_4, ",", too_few = "align_start",
                       names = c("var_4_base", "var_4_3mo", "var_4_6mo", "var_4_12mo")) |> 
  pivot_longer(2:5,
               names_to = "timepoint",
               values_to = "var_4",
               names_prefix = "var_4_")
temp5 <-  dat |> 
  select(id, var_5) |> 
  separate_wider_delim(var_5, ",", too_few = "align_start",
                       names = c("var_5_base", "var_5_3mo", "var_5_6mo", "var_5_12mo")) |> 
  pivot_longer(2:5,
               names_to = "timepoint",
               values_to = "var_5",
               names_prefix = "var_5_")
temp6 <-  dat |> 
  select(id, var_6) |> 
  separate_wider_delim(var_6, ",", too_few = "align_start",
                       names = c("var_6_base", "var_6_3mo", "var_6_6mo", "var_6_12mo")) |> 
  pivot_longer(2:5,
               names_to = "timepoint",
               values_to = "var_6",
               names_prefix = "var_6_")
# Merge all dataframes together
dat_converted <-  left_join(temp1, temp2)
dat_converted <-  left_join(dat_converted, temp3)
dat_converted <-  left_join(dat_converted, temp4)
dat_converted <-  left_join(dat_converted, temp5)
dat_converted <-  left_join(dat_converted, temp6)
dat_converted |> 
  kable(align = "c", digits = 2)

id	timepoint	var_1	var_2	var_3	var_4	var_5	var_6
1	base	1	0	0	38	2	4
1	3mo	1	8	0	30	4	10
1	6mo	1	0	0	38	1	1
1	12mo	1	0	0	30	3	1
2	base	1	0	0	24	1	1
2	3mo	1	0	0	34	1	2
2	6mo	1	0	1	28	1	1
2	12mo	1	0	2	30	3	3
3	base	0	-13000	-13000	-13000	-13000	-13000
3	3mo	0	-13000	-13000	-13000	-13000	-13000
3	6mo	0	-13000	-13000	-13000	-13000	-13000
3	12mo	0	-13000	-13000	-13000	-13000	-13000
4	base	1	0	0	50	0	0
4	3mo	1	0	0	60	3	1
4	6mo	1	0	0	60	3	0
4	12mo	1	0	0	61	0	0
5	base	1	0	0	40	4	6
5	3mo	1	0	0	16	6	8
5	6mo	1	20	0	20	7	4
5	12mo	-11000	-13000	-13000	-13000	-13000	-13000

What’s wrong with this approach? Nothing, really. The main criticism is that in creating each temporary dataframe we are copying and pasting the relevant code block 5 times. We then need to change the names of variables in the copies - each block has 8 places that a variable name appears, so that means 40 changes across the workflow. That’s a lot of places where we could potentially make an error by simply hitting the wrong key. But even if you don’t make mistakes, consider the all too common scenario where an updated dataset is sent to you at a later date for re-analysis. Sometimes, variable names are inadvertently changed between datasets - that also means many changes in your code.

There must be more robust approaches to this task.

3.2 Becoming DRY

Whatever approach we take from here, there should be one coding concept that we incorporate as a foundational commonality - and that is the function. I’m not going to teach you how to write functions here - there are plenty of resources online and R for Data Science is a good place to start.

3.2.1 Write a function

A function is a good way to encapsulate some code that you might want to use more than once. We know what code we want to re-use - it’s essentially the following block, but we want to generalise it to be usable with any of the 6 variables in the original dataframe, not just var_1.

temp1 <-  dat |> 
  select(id, var_1) |> 
  separate_wider_delim(var_1, ",", too_few = "align_start",
                       names = c("var_1_base", "var_1_3mo", "var_1_6mo", "var_1_12mo")) |> 
  pivot_longer(2:5,
               names_to = "timepoint",
               values_to = "var_1",
               names_prefix = "var_1_")

One version of that code block wrapped up in a function - which I’ve called split_to_long() - is shown below.

split_to_long <-  function(col){
  i <-  substr(col, 5, 5)
  temp <-  dat |>
    select(id, all_of(col)) |>
    separate_wider_delim(all_of(col), 
                         ",", 
                         too_few = "align_start", 
                         names = c(paste0(col,"_", c("base", "3mo", "6mo", "12mo")))) |>
    pivot_longer(2:5,
                 names_to = "timepoint",
                 values_to = paste0("var_", i),
                 names_prefix = paste0("var_", i, "_"))
  if (i > 1){
    temp <-  temp[3]
  }
  temp
}

What does split_to_long() do?

Line 1 declares the function and specifies the single argument it accepts - col (i.e. the name of the column we are interested in splitting and reshaping).
Line 2 gets the variable number belonging to col and assigns it to i. This works here because all variable names follow the same format with the number being at the 5th position in the name.
Lines 3 - 12 take the original dataframe and selects out two variables - the id variable and col. It then splits and reshapes col into long format and saves it to temp - a dataframe.
Lines 13 - 15 tell the function that for any variable other than the first, we don’t need to retain the id and timepoint variables when creating temp.
Line 16 ‘returns’ temp. This is essential to ensure the function captures the result of interest.

Hooray!

We now have a function which is generalisable to any column of the original dataframe. If we call split_to_long("var_1"), we get:

Code

split_to_long("var_1") |> 
  kable(align = "c", digits = 2)

id	timepoint	var_1
1	base	1
1	3mo	1
1	6mo	1
1	12mo	1
2	base	1
2	3mo	1
2	6mo	1
2	12mo	1
3	base	0
3	3mo	0
3	6mo	0
3	12mo	0
4	base	1
4	3mo	1
4	6mo	1
4	12mo	1
5	base	1
5	3mo	1
5	6mo	1
5	12mo	-11000

Similarly, we could call split_to_long("var_6"), and get:

Code

split_to_long("var_6") |> 
  kable(align = "c", digits = 2)

var_6
4
10
1
1
1
2
1
3
-13000
-13000
-13000
-13000
0
1
0
0
6
8
4
-13000

Remember that in this case we asked the function not to keep the id and timeline variables.

The important point to take away from this is that in each case, we have reduced the code from 8 lines to just 1 line, and been able to produce the same result.

3.2.2 Recycle that function

At this juncture, you may be thinking that we’ve done all we can do to be DRY and the only thing left is to call our function on every column in our original dataset. Well, yes, that would work as advertised.

dat_converted <-  cbind(
  split_to_long("var_1"),
  split_to_long("var_2"),
  split_to_long("var_3"),
  split_to_long("var_4"),
  split_to_long("var_5"),
  split_to_long("var_6"))
dat_converted |> 
  kable(align = "c", digits = 2)

id	timepoint	var_1	var_2	var_3	var_4	var_5	var_6
1	base	1	0	0	38	2	4
1	3mo	1	8	0	30	4	10
1	6mo	1	0	0	38	1	1
1	12mo	1	0	0	30	3	1
2	base	1	0	0	24	1	1
2	3mo	1	0	0	34	1	2
2	6mo	1	0	1	28	1	1
2	12mo	1	0	2	30	3	3
3	base	0	-13000	-13000	-13000	-13000	-13000
3	3mo	0	-13000	-13000	-13000	-13000	-13000
3	6mo	0	-13000	-13000	-13000	-13000	-13000
3	12mo	0	-13000	-13000	-13000	-13000	-13000
4	base	1	0	0	50	0	0
4	3mo	1	0	0	60	3	1
4	6mo	1	0	0	60	3	0
4	12mo	1	0	0	61	0	0
5	base	1	0	0	40	4	6
5	3mo	1	0	0	16	6	8
5	6mo	1	20	0	20	7	4
5	12mo	-11000	-13000	-13000	-13000	-13000	-13000

But do you notice that we have inadvertently written code that is again repetitive? It might not seem like much, but we’ve called split_to_long() 6 times - one for each column that we want to convert. Is there a way to be even more DRY than this? Well it turns out that there is, and the general idea is that we will take our function and iterate over the columns in the data frame. In other words, automatically apply the function without having to manually type the function out multiple times. There are two primary ways to do this - let’s consider them now.

3.2.3 Function + for-loop

The for-loop is one of the main control-flow constructs in almost any programming language, including R. It can be used to iterate over a group of objects (e.g. a vector, a list, a dataframe), applying the same manipulations to each object. In our specific case we are going to use a for-loop to iterate over the variable columns of our original dataframe. The for-loop I have written is fairly simple - the basic idea is to cycle over each of the relevant variables, applying our function to each variable in term. A dataframe is created and appended to in each cycle.

dat_converted <-  data.frame(matrix(ncol = 0, nrow = 20))
for (name in names(dat[2:7])){
  dat_converted <-  cbind(dat_converted, split_to_long(name))
}
dat_converted |> 
  kable(align = "c", digits = 2)

id	timepoint	var_1	var_2	var_3	var_4	var_5	var_6
1	base	1	0	0	38	2	4
1	3mo	1	8	0	30	4	10
1	6mo	1	0	0	38	1	1
1	12mo	1	0	0	30	3	1
2	base	1	0	0	24	1	1
2	3mo	1	0	0	34	1	2
2	6mo	1	0	1	28	1	1
2	12mo	1	0	2	30	3	3
3	base	0	-13000	-13000	-13000	-13000	-13000
3	3mo	0	-13000	-13000	-13000	-13000	-13000
3	6mo	0	-13000	-13000	-13000	-13000	-13000
3	12mo	0	-13000	-13000	-13000	-13000	-13000
4	base	1	0	0	50	0	0
4	3mo	1	0	0	60	3	1
4	6mo	1	0	0	60	3	0
4	12mo	1	0	0	61	0	0
5	base	1	0	0	40	4	6
5	3mo	1	0	0	16	6	8
5	6mo	1	20	0	20	7	4
5	12mo	-11000	-13000	-13000	-13000	-13000	-13000

Again, this gives us the desired result and combined with our function, this approach uses 21 lines of code, and importantly there is no repetition.

3.2.4 Function + purrr::map

The map family of functions within the purrr package are the tidyverse answer to base Rs apply family of functions. They essentially do the same thing, but map tend to be more consistent in their application, and are of course, designed for use with other tidyverse functions.

At its core, R is a functional programming language, which essentially means that it lends itself to using functions to solve complex problems. In this way it has the ability “to wrap up for loops in a function, and call that function instead of using the for loop directly”. In the current context, we could then imagine a way to replace our for-loop with another function - i.e. we could feed our function to another function that does the job of the for-loop. Why would you do this? Well, it is usually the case that map (or apply) not only requires less code to achieve the same result, but makes your code more readable, opens your code to parallelisation, and is slightly faster than a comparable for-loop (it’s a misconception that for-loops are slow if you set them up correctly - e.g. you should always pre-allocate a vector length rather than growing the vector in the loop).

Here we use the map function to apply our split_to_long function to each variable in the original dataframe. Because map returns a list, we take one extra step to reformat the list to a dataframe with list_cbind(). This approach uses 19 lines of code, and again, importantly, there is no repetition.

dat_converted <-  map(names(dat)[2:7], split_to_long) |> 
  list_cbind()
dat_converted |> 
  kable(align = "c", digits = 2)

id	timepoint	var_1	var_2	var_3	var_4	var_5	var_6
1	base	1	0	0	38	2	4
1	3mo	1	8	0	30	4	10
1	6mo	1	0	0	38	1	1
1	12mo	1	0	0	30	3	1
2	base	1	0	0	24	1	1
2	3mo	1	0	0	34	1	2
2	6mo	1	0	1	28	1	1
2	12mo	1	0	2	30	3	3
3	base	0	-13000	-13000	-13000	-13000	-13000
3	3mo	0	-13000	-13000	-13000	-13000	-13000
3	6mo	0	-13000	-13000	-13000	-13000	-13000
3	12mo	0	-13000	-13000	-13000	-13000	-13000
4	base	1	0	0	50	0	0
4	3mo	1	0	0	60	3	1
4	6mo	1	0	0	60	3	0
4	12mo	1	0	0	61	0	0
5	base	1	0	0	40	4	6
5	3mo	1	0	0	16	6	8
5	6mo	1	20	0	20	7	4
5	12mo	-11000	-13000	-13000	-13000	-13000	-13000

4 Summary

Before wrapping up, let’s rehash the coding expenditure in each of our three main approaches to solving this problem:

Not DRY - 55 lines of code.

Function + for-loop - 21 lines of code.

Function + purrr::map - 19 lines of code.

I hope this post has given you some ideas about how you might be able to streamline your R code. The main takeaway from this is that when presented with a coding problem, think about the big picture first. Then give some thought to how we can break that larger task into smaller steps - those steps can usually be conceptualised as one or more functions, which we can then use to iterate over some component of our data.

Improving our coding knowledge is something we should be aiming to do on a regular basis and after writing this post my aim is to ‘try for DRY’ as much as possible in all of my future R interactions. Until next time…