Optimize calls to mutate and summarise?

Question

I have this R script:

rm(list = ls())

library(tidyr)
suppressWarnings(library(dplyr))
outFile = "zFinal.lua"

cat("\014\n")

cat(file = outFile, sep = "")

filea <- read.csv("csva.csv", strip.white = TRUE)
fileb <- read.csv("csvb.csv", strip.white = TRUE, sep = ";", header=FALSE)

df <-    
    merge(filea, fileb, by.x = c(3), by.y = c(1)) %>%
    subset(select = c(1, 3, 6, 2)) %>%
    arrange(ColA, ColB, V2) %>%
    group_by(ColA) %>%
    mutate(V2 = paste0('"', V2, "#", ColB, '"')) %>%
    summarise(ID = paste(V2, collapse = ", ", sep=";")) %>%
    mutate(ID = paste0('["', ColA, '"] = {', ID, '},')) %>%
    mutate(ID = paste0('\t\t', ID))

df <- df[c("ID")]

cat("\n\tmyTable = {\n", file = outFile, append = TRUE, sep = "\n")
write.table(df, append = TRUE, file = outFile, sep = ",", quote = FALSE, row.names = FALSE, col.names = FALSE)
cat("\n\t}", file = outFile, append = TRUE, sep = "\n")

# Done
cat("\nDONE.", sep = "\n")

As you can see, this script opens csva.csv and csvb.csv.

This is csva.csv:

ID,ColA,ColB,ColC,ColD
2,3,100,1,1
3,7,300,1,1
5,7,200,1,1
11,22,900,1,1
14,27,500,1,1
16,30,400,1,1
20,36,900,1,1
23,39,800,1,1
24,42,700,1,1
29,49,800,1,1
45,3,200,1,1

And this is csvb.csv:

100;file1
200;file2
300;file3
400;file4

This is the output file that my script and the csv files produce:

myTable = {

    ["3"] = {"file1#100", "file2#200"},
    ["7"] = {"file2#200", "file3#300"},
    ["30"] = {"file4#400"},

}

This output file is exactly what I want. It's perfect.

This is what the script does. I'm not sure I can explain this very well so if I don't do a good job at that, please skip this section.

For each line in csva.csv, if ColC (csva) contains a number that is contained in Column 1 (csvb), then the output file should contain a line like this:

["3"] = {"file1#100", "file2#200"},

So, in the above example, the first line in ColA (csva) contains number 3 and colB for that line is 100. In csvb, column 1 contains 100 and column 2 contains file1#100.

Because csva contains another number 3 in ColA (the last line), this is also processed and output to the same line.

Ok so my script runs very well indeed and produces perfect output. The problem is it takes too long to run. csva and csvb in my question here are only a few lines long so the output is instant.

However, the data I have to work with in the real world - csva is over 300,000 lines and csvb is over 900,000 lines. So the script takes a long, long time to run (too long to make it feasible). It does work beautifully but it takes far too long to run.

From commenting out lines gradually, it seems that the slowdown is with mutate and summarise. Without those lines, the script runs in about 30 seconds. But with mutate and summarise, it takes hours.

I'm not too advanced with R so how can I make my script run faster possibly by improving my syntax or providing faster alternatives to mutate and summarise?

`rm(list = ls())` — [absolutely do not do this.](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/) Also, instead of `cat` you should be using `message` for status messages. For performance, replace `merge` and `subset` with the corresponding dplyr functions. — Konrad Rudolph, Jan 21 '20 at 22:22
What are the corresponding dplyr functions for merge and subset? Can you provide examples using my code as a template? What about summarise? — Jstation, Jan 21 '20 at 22:29
`inner_join` and `filter`. Either way, the slow step here is probably the 300k x 900k join, which potentially creates an absolutely humungous table (potentially hundreds of billions (!) entries). This isn’t viable, the issue needs to be tackled by processing the data in a fundamentally different way, rather than just using more efficient functions. — Konrad Rudolph, Jan 21 '20 at 22:29
I agree. That's the point of my question. How can I do this differently with my limited knowledge of R? My original question was 'How can I make my R script faster' but SO reported that wasn't specific enough so I had to change it. — Jstation, Jan 21 '20 at 22:35

wibeasley · Answer 1 · 2020-01-21T23:06:08.600

Here's a dplyr approach that closely follows yours. The real differences are that rows and columns are dropped from the object as soon as possible so there's less baggage to move around.

I'm making some guesses what will actually help with the large datasets. Please report back what the before & after durations are. I like how you said which calls were taking the longest; reporting the new bottles would help too.

If this isn't fast enough, the next easiest move is probably move to sqldf (which uses SQLite under the cover) or data.table. Both require learning a different syntax (unless you already know sql), but could be worth your time in the long run.

# Pretend this info is being read from a file
str_a <-
"ID,ColA,ColB,ColC,ColD
2,3,100,1,1
3,7,300,1,1
5,7,200,1,1
11,22,900,1,1
14,27,500,1,1
16,30,400,1,1
20,36,900,1,1
23,39,800,1,1
24,42,700,1,1
29,49,800,1,1
45,3,200,1,1"

str_b <-
"100;file1
200;file2
300;file3
400;file4"


# Declare the desired columns and their data types.
#   Include only the columns needed.  Use the smaller 'integer' data type where possible.
col_types_a <- readr::cols_only(
  `ID`      = readr::col_integer(),
  `ColA`    = readr::col_integer(),
  `ColB`    = readr::col_integer(),
  `ColC`    = readr::col_integer()
  # `ColD`    = readr::col_integer() # Exclude columns never used
)
col_types_b <- readr::cols_only(
  `ColB`      = readr::col_integer(),
  `file_name` = readr::col_character()
)

# Read the file into a tibble
ds_a <- readr::read_csv(str_a, col_types = col_types_a)
ds_b <- readr::read_delim(str_b, delim = ";", col_names = c("ColB", "file_name"), col_types = col_types_b)

ds_a %>% 
  dplyr::select( # Quickly drop as many columns as possible; avoid reading if possible
    ID,
    ColB,
    ColA
  ) %>% 
  dplyr::left_join(ds_b, by = "ColB") %>% # Join the two datasets
  tidyr::drop_na(file_name) %>%           # Dump the records you'll never use
  dplyr::mutate(                          # Create the hybrid column
    entry = paste0('"', file_name, "#", ColB, '"')
  ) %>%
  dplyr::select(                          # Dump the unneeded columns
    -ID,
    -file_name
  ) %>% 
  dplyr::group_by(ColA) %>%               # Create a bunch of subdatasets
  dplyr::arrange(ColB, entry) %>%         # Sorting inside the group usually is faster?
  dplyr::summarise(
    entry = paste(entry, collapse = ", ", sep = ";")
  ) %>%
  dplyr::ungroup() %>%                    # Stack all the subsets on top of each other
  dplyr::mutate(                          # Mush the two columns
    entry = paste0('\t\t["', ColA, '"] = {', entry, '},')
  ) %>% 
  dplyr::pull(entry) %>%                  # Isolate the desired vector
  paste(collapse = "\n") %>%              # Combine all the elements into one.
  cat()

result:

        ["3"] = {"file1#100", "file2#200"},
        ["7"] = {"file2#200", "file3#300"},
        ["30"] = {"file4#400"},

This answer is excellent. The script now completes in about 10 seconds (compared to hours with my script). I'm astonished how fast it runs. Absolutely fabulous work, thank you so much. I got it to output to a file using cat(file="output.lua"). You are amazing @wibeasley. — Jstation, Jan 21 '20 at 23:42
Your question was so clear and reproducible. It made it easy to determine which constraints exists, and where flexibility was allowed. --I'm astonished too. You sure it's the same? I guess there were a lot of columns & rows that could be dumped first. I'm posting another answer with similar syntax, but moderately different mechanisms underneath. I'm curious how it performs. — wibeasley, Jan 22 '20 at 04:42
@H 1 edited his code and now his code is the fastest with a truly incredulous run time of around 4.8 seconds when used with the reference data files (one file is over 300,000 lines and the other is over 900,000 lines). This is truly miraculous speed so as my question is about reducing the run time, I have changed my answer. But I didn't really know what to do as his code only works with your column definitions but as your answer contains outstanding code, I have to say that you and H 1 have been equally helpful. — Jstation, Jan 22 '20 at 15:43

Ritchie Sacramento · Accepted Answer · 2020-01-22T04:43:49.897

3

Here is a more compact version of your code in base R that should offer something of a performance boost.

(Edited to match the data provided by wibeasley.)

ds_a$file_name <- ds_b$file_name[match(ds_a$ColB, ds_b$ColB)]
ds_a <- ds_a[!is.na(ds_a$file_name), -4]
ds_a <- ds_a[order(ds_a$ColB),]
ds_a$file_name <- paste0('"', ds_a$file_name, "#", ds_a$ColB, '"')
res <- tapply(ds_a$file_name, ds_a$ColA, FUN = paste,  collapse = ", ", sep=";")
res <- paste0("\t\t[\"", names(res), "\"] = {", res, "},", collapse = "\n")
cat("\n\tmyTable = {", res, "\t}", sep = "\n\n")

Outputting:

myTable = {

    ["3"] =  {"file1#100", "file2#200"},
    ["7"] =  {"file2#200", "file3#300"},
    ["30"] =  {"file4#400"},

}

edited Jan 22 '20 at 04:43

answered Jan 21 '20 at 23:46

Ritchie Sacramento

29,890
4
48
56

2

Thanks for your answer, @H 1. It does produce perfect output and I love how short and sweet it is. However, with my data, your script takes about 30 seconds to run whereas the accepted answer completes in a lightning fast 10 seconds. However, your script is definitely worthy of a thank you for being short and sweet while producing perfect output. You are awesome. – Jstation Jan 22 '20 at 00:29
WOW !!! This edited version of your answer runs in less than 5 seconds !!! How it manages to process so much data with such speed is incredible! And your code is very short and neat. Now as my question is about reducing the run time and your code is definitely the fastest at less than 5 seconds, I have to change my mark to your answer. I hope @wibeasley doesn't mind because his code was lightning fast too at just over 8 seconds. I would hope that for future readers, both answers might benefit performance coding in R. So your code is definitely the winning formula at around 4.8 seconds!! – Jstation Jan 22 '20 at 15:37
@H 1 Do you think you could add comments to show what each line of the code does and add the column definitions and open csv file commands above the code to make the answer complete? – Jstation Jan 22 '20 at 15:54

score 0 · Answer 3 · answered Jan 21 '20 at 22:42

you could try to load your table as a data.table instead. usually data.tables are faster in their operations than data.frames

library(data.table)
filea <- fread("csva.csv")

just check that it is still a data.table before you come to the mutate function (just print it, you will see the obvious difference to the data.frame).

score 0 · Answer 4 · answered Jan 22 '20 at 04:44

Here's another solution that leverages data.table's performance while still staying within your dplyr knowledge. I'm not sure there's much room to improve within only 10 seconds, but theoretically this could help larger datasets where the cost to create the indexes is amortized over a longer stretch of execution.

The dtplyr package is translating the dplyr verbs (that are familiar to you) to data.table syntax under the hood. That's leveraging the keys, which should improve the performance, especially with joining and grouping.

The dtplyr::lazy_dt feature might help optimize the dplyr-to-data.table translation.

Finally, vroom replaces readr, mostly out of curiosity. But it's independent from the other changes, and it sounds like that's never been a bottleneck

col_types_a <- vroom::cols_only(
  `ID`      = vroom::col_integer(),
  `ColA`    = vroom::col_integer(),
  `ColB`    = vroom::col_integer(),
  `ColC`    = vroom::col_integer()
  # `ColD`    = vroom::col_integer() # Leave out this column b/c it's never used
)
col_types_b <- vroom::cols_only(
  `ColB`      = vroom::col_integer(),
  `file_name` = vroom::col_character()
)
ds_a <- vroom::vroom(str_a, col_types = col_types_a)
ds_b <- vroom::vroom(str_b, delim = ";", col_names = c("ColB", "file_name"), col_types = col_types_b)

# ds_a <- data.table::setDT(ds_a, key = c("ColB", "ColA"))
# ds_b <- data.table::setDT(ds_b, key = "ColB")

ds_a <- dtplyr::lazy_dt(ds_a, key_by = c("ColB", "ColA"))    # New line 1
ds_b <- dtplyr::lazy_dt(ds_b, key_by = "ColB")               # New line 2

ds_a %>% 
  dplyr::select( # Quickly drop as many columns as possible; avoid reading if possible
    ID,
    ColB,
    ColA
  ) %>%
  dplyr::inner_join(ds_b, by = "ColB") %>%                   # New line 3 (replaces left join)
  # tidyr::drop_na(file_name) %>%                            # Remove this line
  # dplyr::filter(!is.na(file_name)) %>%                     # Alternative w/ left join
  dplyr::mutate(
    entry = paste0('"', file_name, "#", ColB, '"')
  ) %>%
  dplyr::select( # Dump the uneeded columns
    -ID,
    -file_name
  ) %>% 
  dplyr::group_by(ColA) %>%
  dplyr::arrange(ColB, entry) %>%  # Sort inside the group usually helps
  dplyr::summarise(
    entry = paste(entry, collapse = ", ", sep=";")
  ) %>%
  dplyr::ungroup() %>% 
  dplyr::mutate(
    entry = paste0('\t\t["', ColA, '"] = {', entry, '},')
  ) %>% 
  dplyr::pull(entry) %>% # Isolate the desired vector
  paste(collapse = "\n") %>% 
  cat()

Optimize calls to mutate and summarise?

4 Answers4