Having recently transitioned from a pure academic setting to a proper data analysis team, I was (shock horror) told that all the tidyverse coding skills I had absorbed up until this point would be replaced with a new and mysterious syntax called data.table (cue Weird Al’s Everything You Know is Wrong).
So, in light of having to quickly learn how to efficiently work in this new language, this post will catalogue my journey in translating my existing dataverse knowledge and code into equivalent or better data.table.
Whereas the Tidyverse ecosystem attempts to focus on the readability and flexibility
of code (although some would argue this is entirely untrue),
data.dable is more
pragmatic: focusing on the speed and conciseness of operations, making code shorter
and run-times quicker.
The main way that
data.table achieves this syntax-wise is by doing away with the
need to perform data filtering and manipulation with separate functions
arrange() etc…) piped into one another, instead preferring
to keep as much manipulation within the
data.table expression as possible.
Here are some side-by-side examples of some simple manipulation operations using the mtcars test dataset.
Let’s try filtering for a certain value and changing another variable for all the filtered items.
mtcars %>% filter(cyl == 6) %>% mutate(gear = 4)
mtcars_dt[ cyl == 6, gear := 4 ]
tidyverse I would first filter and then pipe the output into
a mutate function, with data.table I do all the filtering and mutation within a
It should also be noted that unlike the
tidyverse, which runs under the assumption that
you will be saving the modified data to a specific data.frame, data tables are modified as-is within
memory unless specified otherwise, removing the need to create tons of different
versions of the same set of data when manipulating things.
Adding or removing a column to a dataset follows many of the same principles as modifying values.
mtcars %>% add_column(New = 0, New2 = 0)
mtcars_dt[, c("New", "New2") := 0,]
New columns can also be generated using data from within the table without needing to specify
each individual variables’ source dataset, as
data.table treats any column as a standard
variable, making code substantially shorter.
mtcars %>% add_column(test = mtcars$disp + sum(mtcars$mpg, mtcars$hp)) ## Or using the dot-notation to reference the `data` object mtcars %>% tibble::add_column(test = .$disp + sum(.$mpg, .$hp))
mtcars_dt[, test := disp + sum(mpg, hp)]
Let’s consider how I would perform some basic joining operations
# inner Join inner_join(df1, df2) # Left join left_join(df1, df2) # Full join full_join(df1, df2)
# Dummy data dt1 <- data.table(x = 1:5, y = 5:1) dt2 <- data.table(x = 3:7, y = 7:3) # Inner Join dt1[ dt2 , on = .(x) , nomatch = 0 ## indicates you want an inner join ] # Left Join dt2[ dt1 , on = .(x) , nomatch = NA ## indicates you want a left-join on dt1 ] # Full join (need to specify sort as false to prevent key creation, we don't use keys here) merge(dt1, dt2, by = c("x","y"), all = TRUE, sort = FALSE )
Again, in a more pragmatic way of doing things,
data.table prefers to keep everything
data.table statement itself, or alternatively with the merge function,
whilst tidyverse opts for a more readable but bloated system of separate named functions.
Let’s say I want to turn all cyl points into variables
mtcars <- mtcars %>% pivot_wider(names_from = cyl, names_prefix = "cyl_point_", values_from = cyl) mtcars
mtcars_dt[, cyl_point := paste("cyl_point", cyl, sep ="_")] mtcars_dt<- dcast( mtcars_dt , carname ~ cyl_point , value.var = "cyl") mtcars_dt
Now let’s lengthen the results of the above pivot back to a single variable
mtcars <- mtcars %>% pivot_longer("cyl_point_6":"cyl_point_8", names_to = "cyl", names_prefix = "cyl_point_") %>% mutate(cyl = as.numeric(cyl)) mtcars
mtcars_dt <- melt(mtcars_dt, id_vars = "id", measure.vars = c("cyl_point_4", "cyl_point_6", "cyl_point_8"), variable.name = "cyl_point" ) mtcars_dt[, cyl_point := as.numeric(gsub("cyl_point_", "", cyl_point))] mtcars_dt
data.table also allows for multiple operations to be performed
in a single command (referred to in the documentation as “chaining”), avoiding
the need to store intermediate data in a separate variable, saving even more
memory and making any clean-up a lot faster.
mtcars_dt <- setDT(mtcars) output <- mtcars_dt[, .(avg_disp=mean(disp), avg_mpg=mean(mpg)), by=cyl][order(cyl), ] output
Hopefully this blog post has demonstrated some of the advantages
data.table has over
tidyverse in both its syntax and the efficiency and speed of its handling of data.
For those that still want/need the syntax of
dplyr there is a great
package called dtplyr, which translates the standard
dplyr syntax into
data.table can understand. That way you can use the more verbose
tidyverse, but still retain the efficiency and speed advantages that