Having recently transitioned from a pure academic setting to a proper data analysis team, I was (shock horror) told that all the tidyverse coding skills I had absorbed up until this point would be replaced with a new and mysterious syntax called data.table (cue Weird Al’s Everything You Know is Wrong).
So, in light of having to quickly learn how to efficiently work in this new language, this post will catalogue my journey in translating my existing dataverse knowledge and code into equivalent or better data.table.
Whereas the Tidyverse ecosystem attempts to focus on the readability and flexibility
of code (although some would argue this is entirely untrue), data.dable
is more
pragmatic: focusing on the speed and conciseness of operations, making code shorter
and run-times quicker.
The main way that data.table
achieves this syntax-wise is by doing away with the
need to perform data filtering and manipulation with separate functions
(think filter()
, select()
, arrange()
etc…) piped into one another, instead preferring
to keep as much manipulation within the data.table[]
expression as possible.
Here are some side-by-side examples of some simple manipulation operations using the mtcars test dataset.
Let’s try filtering for a certain value and changing another variable for all the filtered items.
tidyverse
mtcars %>%
filter(cyl == 6) %>%
mutate(gear = 4)
data.table
mtcars_dt[ cyl == 6, gear := 4 ]
Where with tidyverse
I would first filter and then pipe the output into
a mutate function, with data.table I do all the filtering and mutation within a
single command.
It should also be noted that unlike the tidyverse
, which runs under the assumption that
you will be saving the modified data to a specific data.frame, data tables are modified as-is within
memory unless specified otherwise, removing the need to create tons of different
versions of the same set of data when manipulating things.
Adding or removing a column to a dataset follows many of the same principles as modifying values.
tidyverse
mtcars %>%
add_column(New = 0, New2 = 0)
data.table
mtcars_dt[, c("New", "New2") := 0,]
New columns can also be generated using data from within the table without needing to specify
each individual variables’ source dataset, as data.table
treats any column as a standard
variable, making code substantially shorter.
tidyverse
mtcars %>%
add_column(test = mtcars$disp + sum(mtcars$mpg, mtcars$hp))
## Or using the dot-notation to reference the `data` object
mtcars %>%
tibble::add_column(test = .$disp + sum(.$mpg, .$hp))
data.table
mtcars_dt[, test := disp + sum(mpg, hp)]
Let’s consider how I would perform some basic joining operations
tidyverse
# inner Join
inner_join(df1, df2)
# Left join
left_join(df1, df2)
# Full join
full_join(df1, df2)
data.table
# Dummy data
dt1 <- data.table(x = 1:5, y = 5:1)
dt2 <- data.table(x = 3:7, y = 7:3)
# Inner Join
dt1[
dt2
, on = .(x)
, nomatch = 0 ## indicates you want an inner join
]
# Left Join
dt2[
dt1
, on = .(x)
, nomatch = NA ## indicates you want a left-join on dt1
]
# Full join (need to specify sort as false to prevent key creation, we don't use keys here)
merge(dt1, dt2,
by = c("x","y"),
all = TRUE, sort = FALSE
)
Again, in a more pragmatic way of doing things, data.table
prefers to keep everything
within the data.table[]
statement itself, or alternatively with the merge function,
whilst tidyverse opts for a more readable but bloated system of separate named functions.
Let’s say I want to turn all cyl points into variables
tidyverse
mtcars <- mtcars %>%
pivot_wider(names_from = cyl, names_prefix = "cyl_point_", values_from = cyl)
mtcars
data.table
mtcars_dt[, cyl_point := paste("cyl_point", cyl, sep ="_")]
mtcars_dt<- dcast(
mtcars_dt
, carname ~ cyl_point
, value.var = "cyl")
mtcars_dt
Now let’s lengthen the results of the above pivot back to a single variable
tidyverse
mtcars <- mtcars %>%
pivot_longer("cyl_point_6":"cyl_point_8", names_to = "cyl", names_prefix = "cyl_point_") %>%
mutate(cyl = as.numeric(cyl))
mtcars
data.table
mtcars_dt <- melt(mtcars_dt,
id_vars = "id",
measure.vars = c("cyl_point_4", "cyl_point_6", "cyl_point_8"),
variable.name = "cyl_point"
)
mtcars_dt[, cyl_point := as.numeric(gsub("cyl_point_", "", cyl_point))]
mtcars_dt
data.table
also allows for multiple operations to be performed
in a single command (referred to in the documentation as “chaining”), avoiding
the need to store intermediate data in a separate variable, saving even more
memory and making any clean-up a lot faster.
data.table
mtcars_dt <- setDT(mtcars)
output <- mtcars_dt[, .(avg_disp=mean(disp), avg_mpg=mean(mpg)), by=cyl][order(cyl), ]
output
Hopefully this blog post has demonstrated some of the advantages data.table
has over
tidyverse
in both its syntax and the efficiency and speed of its handling of data.
For those that still want/need the syntax of dplyr
there is a great
package called dtplyr, which translates the standard dplyr
syntax into
operations that data.table
can understand. That way you can use the more verbose
syntax of tidyverse
, but still retain the efficiency and speed advantages that data.table
provides.