Data Table vs Tidyverse

Introduction

Having recently transitioned from a pure academic setting to a proper data analysis team, I was (shock horror) told that all the tidyverse coding skills I had absorbed up until this point would be replaced with a new and mysterious syntax called data.table (cue Weird Al’s Everything You Know is Wrong).

So, in light of having to quickly learn how to efficiently work in this new language, this post will catalogue my journey in translating my existing dataverse knowledge and code into equivalent or better data.table.

Syntax differences

Whereas the Tidyverse ecosystem attempts to focus on the readability and flexibility of code (although some would argue this is entirely untrue), data.dable is more pragmatic: focusing on the speed and conciseness of operations, making code shorter and run-times quicker.

The main way that data.table achieves this syntax-wise is by doing away with the need to perform data filtering and manipulation with separate functions (think filter(), select(), arrange() etc…) piped into one another, instead preferring to keep as much manipulation within the data.table[] expression as possible.

Here are some side-by-side examples of some simple manipulation operations using the mtcars test dataset.

Let’s try filtering for a certain value and changing another variable for all the filtered items.

tidyverse

mtcars %>% 
  filter(cyl == 6) %>% 
  mutate(gear = 4) 

data.table

mtcars_dt[ cyl == 6, gear := 4 ]

Where with tidyverse I would first filter and then pipe the output into a mutate function, with data.table I do all the filtering and mutation within a single command.

It should also be noted that unlike the tidyverse, which runs under the assumption that you will be saving the modified data to a specific data.frame, data tables are modified as-is within memory unless specified otherwise, removing the need to create tons of different versions of the same set of data when manipulating things.

Data manupulation examples

Adding/removing columns

Adding or removing a column to a dataset follows many of the same principles as modifying values.

tidyverse

mtcars %>% 
  add_column(New = 0, New2 = 0)

data.table

mtcars_dt[, c("New", "New2") := 0,]

New columns can also be generated using data from within the table without needing to specify each individual variables’ source dataset, as data.table treats any column as a standard variable, making code substantially shorter.

tidyverse

mtcars %>% 
  add_column(test = mtcars$disp + sum(mtcars$mpg, mtcars$hp))

## Or using the dot-notation to reference the `data` object
mtcars %>%
  tibble::add_column(test = .$disp + sum(.$mpg, .$hp))

data.table

mtcars_dt[, test := disp + sum(mpg, hp)] 

Joining data

Let’s consider how I would perform some basic joining operations

tidyverse

# inner Join
inner_join(df1, df2)

# Left join
left_join(df1, df2)

# Full join
full_join(df1, df2)

data.table

# Dummy data
dt1 <- data.table(x = 1:5, y = 5:1)
dt2 <- data.table(x = 3:7, y = 7:3)

# Inner Join
dt1[
  dt2
  , on = .(x)
  , nomatch = 0  ## indicates you want an inner join
]

# Left Join
dt2[
  dt1
  , on = .(x)
  , nomatch = NA  ## indicates you want a left-join on dt1
]

# Full join (need to specify sort as false to prevent key creation, we don't use keys here)
merge(dt1, dt2, 
      by = c("x","y"), 
      all = TRUE, sort = FALSE
      )

Again, in a more pragmatic way of doing things, data.table prefers to keep everything within the data.table[] statement itself, or alternatively with the merge function, whilst tidyverse opts for a more readable but bloated system of separate named functions.

Pivoting data

Widening

Let’s say I want to turn all cyl points into variables

tidyverse

mtcars <- mtcars %>% 
  pivot_wider(names_from = cyl, names_prefix = "cyl_point_", values_from = cyl)
mtcars

data.table

mtcars_dt[, cyl_point := paste("cyl_point", cyl, sep ="_")]

mtcars_dt<- dcast(
  mtcars_dt
  , carname ~ cyl_point
  , value.var = "cyl")

mtcars_dt

Lengthening

Now let’s lengthen the results of the above pivot back to a single variable

tidyverse

mtcars <- mtcars %>% 
  pivot_longer("cyl_point_6":"cyl_point_8", names_to = "cyl", names_prefix = "cyl_point_") %>%
  mutate(cyl = as.numeric(cyl))

mtcars

data.table

mtcars_dt <- melt(mtcars_dt,
  id_vars = "id",
  measure.vars = c("cyl_point_4", "cyl_point_6", "cyl_point_8"),
  variable.name = "cyl_point"
  )

mtcars_dt[, cyl_point := as.numeric(gsub("cyl_point_", "", cyl_point))]
mtcars_dt

Chaining operations

data.table also allows for multiple operations to be performed in a single command (referred to in the documentation as “chaining”), avoiding the need to store intermediate data in a separate variable, saving even more memory and making any clean-up a lot faster.

data.table

mtcars_dt <- setDT(mtcars)
output <- mtcars_dt[, .(avg_disp=mean(disp), avg_mpg=mean(mpg)), by=cyl][order(cyl), ]
output

Conclusion

Hopefully this blog post has demonstrated some of the advantages data.table has over tidyverse in both its syntax and the efficiency and speed of its handling of data.

For those that still want/need the syntax of dplyr there is a great package called dtplyr, which translates the standard dplyr syntax into operations that data.table can understand. That way you can use the more verbose syntax of tidyverse, but still retain the efficiency and speed advantages that data.table provides.