2

Background

I have some data manipulation to do on a very large data.table we'll call d1 (~125 million rows x 10 columns) I read into R from .csv form using fread. The data's about car insurance -- transmission-related and engine-related claims. Here's that reading in:

d1 <- fread("c:/analysis.csv", 
               header = TRUE, 
               na.strings = c("", "NA"),
               data.table = TRUE,
               stringsAsFactors = TRUE)

The work I need to do on d1 is fairly simple dplyr stuff -- a few joins (on much smaller tables), a couple mutate's, etc. Let's call those much smaller tables d2 and d3 for reference; assume those are also data.tables read into R in the same manner as d1 above. I've planned these operations (the joins and mutates) out on much smaller "test" subsets of d1, so I know they'll work.

The machine I'm running R on has 32gb of RAM, a very recent Comet Lake Intel Core i5, and a 1TB NVMe SSD. My versions of Windows and R are 64bit.

For reference, here's how much space each dataset is taking:

> format(object.size(d1), "Gb")
[1] "4 Gb"
> format(object.size(d2), "Mb")
[1] "3.2 Mb"

Here's str(d1) so you can see data types:

Classes ‘data.table’ and 'data.frame':  125640181 obs. of  10 variables:
 $ id                   : int  1551444  ...
 $ service_dt           : IDate, format: "2020-11-22"  ...
 $ service_code         : Factor w/ 41817 levels "316",..
 $ problem_code         : Factor w/ 39531 levels "0",
 $ problem_flag         : int  0 0 0 0 0 0 0 1 1 0 ...
 $ problem_type         : Factor w/ 2 levels "transmission","engine": 1 1 ...
 $ customer_dob         : IDate, format: "1976-04-14" "1980-04-25" ...
 $ customer_gender_cd   : Factor w/ 3 levels "F","M","U": 1 2 ...
 $ customer_zip_cd      : Factor w/ 8354 levels "00000","00003"
 $ old_id               : int  13449983 ...

And str(d2) as well:

'data.frame':   37323 obs. of  4 variables:
 $ service_code     : Factor w/ 36281 levels "00002081702",..: 1 2 3 ...
 $ parts_used       : Factor w/ 215 levels "Disc Brake Rotor (Front)",..: 136 ...
 $ category         : Factor w/ 5 levels "Brakes",..: 1 1 ...
 $ subcategory_1    : Factor w/ 24 levels "Rotors",..: 22 20 ...

The Problem

I go to run this join ...

d1 <- left_join(d1, d2, by = c("service_code" = "service_code"))

... and I get this error: Error: cannot allocate vector of size 277.7 Gb. Note that d2 here looks like this:

> dim(d2)
[1] 37323     4

What I've Tried

So I've read several "out of memory in R" posts on StackOverflow trying to figure out what I can do about this. Using data.tables was one suggestion. I've tried to use gc() to run garbage collection before I run the join; that doesn't work either. If I look at memory limits, I get this:

> memory.limit()
[1] 32502

I take this to mean that R is potentially allocating all of my system's RAM already, and I'm not sure if this can go higher.

In other posts, I read about package ff for large datasets, but I ran into other trouble trying to use it. I've successfully made both my data.tables into ffdf format like so:

d1 <- as.ffdf(d1)
d2 <- as.ffdf(d2)

And then tried to run the join, but get this error: Error in UseMethod("left_join") : no applicable method for 'left_join' applied to an object of class "ffdf".

My concern is that even if I can get this particular ff code to run somehow, the rest of the work I have to do is pretty dplyr-dependent. I'm worried that equivalent functions for what I need to do won't exist in the ff universe, if that makes any sense. Is there a way to keep using ff but keep d1 and d2 (and d3, not shown here) as data.tables?

Edit

After some interaction with @serkan -- see their very helpful answer below -- I discovered that d2 had duplicate values of my join key, d2$service_code. Around 1100 of them, in fact. In hindsight, this was likely the bulk of the reason for R getting tripped up on a vector of size 277.7 Gb on a simple left join. So, I learned something today: when left-joining, ensure that your RHS table's key variable is unique!

logjammin
  • 1,121
  • 6
  • 21
  • 2
    sqldf can do out of memory joins if you use `sqldf("select ...", dbname = tempfile())` By default it performs calculations in memory but if you specify dbname it will use the indicated database creating it if it does not exist. – G. Grothendieck Aug 09 '21 at 17:48
  • What does "out of memory" mean in this context? (Sorry for perhaps noob question.) – logjammin Aug 09 '21 at 17:50
  • It effectively means that it can deal with larger problems than R can handle. It is only one line of code so is pretty easy to try. – G. Grothendieck Aug 09 '21 at 17:52
  • 1
    It would be helpful if you shared the structure of d2. If you are using similar data I would expect that joining a 4GB object with 10 columns and adding 3 more columns would usually only add up to 5 or 6 GB unless you have very large data in those columns (like the text of a book or something). Seeing the str of d2 can help us to identify whether the memory increase is as expected. – Adam Sampson Aug 09 '21 at 17:52
  • 1
    My other guess would be that you have a many-to-many join going on here. Which seems strange given that you have an ID column. But you might confirm that `distinct(df2,ID)` has the same number of rows as df2. – Adam Sampson Aug 09 '21 at 17:54
  • @AdamSampson: thanks -- added `str(d2)` right after `str(d1)`. No, nothing in `d2` is all that heavy, so to speak: just factors with small amounts of data in each cell. The longest you'll see is a description of a car part, but none are more than 3-4 words. – logjammin Aug 09 '21 at 18:01
  • 1
    Could you `dput` your `data`? To me it sounds like your `join` by `id` itself is the sinner, not your memory. – Serkan Aug 09 '21 at 19:24
  • @Serkan, I realized when I read your question that I'd misrepresented my own `join` for the purposes of making a good SO example. In my real datasets, I'm joining `d1$service_code` on `d2$service_code` -- not `ID` on `ID`. I've revised the post, have a look and see if that makes a different for your suspicion about join by `id`. – logjammin Aug 09 '21 at 19:43
  • 1
    My suspicion stands. Im currently making an example on my computer with similar specs... Ill be back! – Serkan Aug 09 '21 at 19:44
  • 1
    Okay I'll try and see what `dput` is and how I can fulfill your request. Stand by ... – logjammin Aug 09 '21 at 19:46
  • Actually, even with a correct join my session crashes. So we abort that mission, and go for `dtplyr` instead. Its a `dplyr` frontend to `data.table`, and has all the awesomeness from both world! – Serkan Aug 09 '21 at 20:06
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/235800/discussion-between-serkan-and-logjammin). – Serkan Aug 09 '21 at 20:08

1 Answers1

1

I would propose dtplyr for this - in this fashion,

large_data <- tibble(
        id = 1:125640181,
        value1 = 1:125640181,
        value2 = 1:125640181,
        value3 = 1:125640181,
        value4 = 1:125640181,
        value5 = 1:125640181,
        value6 = 1:125640181,
        value7 = 1:125640181,
        value8 = 1:125640181,
        value9 = 1:125640181
) %>% lazy_dt() 


small_data <- tibble(
        id = 1:37323,
        value1 = 1:37323,
        value2 = 1:37323,
        value3 = 1:37323
        
) %>% lazy_dt() 

And then join by,

joined_data <- left_join(
        large_data,
        small_data, by = "id"
) %>% as_tibble()

Which gives,

# A tibble: 6 x 13
     id value1.x value2.x value3.x value4 value5 value6 value7 value8 value9 value1.y value2.y value3.y
  <int>    <int>    <int>    <int>  <int>  <int>  <int>  <int>  <int>  <int>    <int>    <int>    <int>
1     1        1        1        1      1      1      1      1      1      1        1        1        1
2     2        2        2        2      2      2      2      2      2      2        2        2        2
3     3        3        3        3      3      3      3      3      3      3        3        3        3
4     4        4        4        4      4      4      4      4      4      4        4        4        4
5     5        5        5        5      5      5      5      5      5      5        5        5        5
6     6        6        6        6      6      6      6      6      6      6        6        6        6

On my machine that has 32 Gb RAM as well. And you can keep all dplyr-functions and syntax while abusing the memory-efficiency of data.table!


You can read more about it at https://github.com/tidyverse/dtplyr

Serkan
  • 1,855
  • 6
  • 20
  • 1
    This looks promising -- I'm out of the office for a few minutes but will run back ASAP and try this. I'll let you know. Thanks so much @Serkan. – logjammin Aug 09 '21 at 20:12
  • Yeah its awesome really! Dont forget to `dput` your `data` as well - Id still wager that your `join` is faulty aswell! It looks like its mutating itself. There is no chance two files of that size should spend `270 Gb` – Serkan Aug 09 '21 at 20:16
  • I tried running `dput(d1)` but it took forever -- I reset R after ~8min. I've read the documentation but I don't understand what `dput` does. – logjammin Aug 09 '21 at 20:21
  • 1
    Oh! Hang on! `dput(head(data))` instead! `dput` prints your data such that you can copy paste from console to SO. – Serkan Aug 09 '21 at 20:22
  • 1
    Boy do I have egg on my face. Noticed something with `d2` while doing the `dput` for you: while I had (earlier) removed duplicate *rows* from `d2`, I had not removed duplicate *values* of `service_code`. I just discovered ~1000 of them. Sure enough, when I ran your answer I got an error: `Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values` ... Went back to `d2`, removed all duplicate `d2$service_code`, re-ran your answer and ... voila. Worked. Thanks so so much. I'll edit OP now to reflect this. – logjammin Aug 09 '21 at 20:46
  • 1
    Yeah I thought so! :-) The duplicates are a `join`-killer! – Serkan Aug 09 '21 at 20:56
  • 1
    But keep us posted on whether it is possible to do the joins without `dtplyr`! I wasnt able to do it. – Serkan Aug 09 '21 at 21:06
  • 1
    Roger that. Since my priority here is keeping `dplyr` syntax for my operations, I'm looking into setting up Spark locally on my machine and then using `sparklyr` to do some of this stuff. I'll update with an answer if that works. – logjammin Aug 09 '21 at 21:11