8

After getting help from 2 kind gentlemen, I managed to switch over to data tables from data frame+plyr.

The Situation and My Questions

As I worked on, I noticed that peak memory usage nearly doubled from 3.5GB to 6.8GB (according to Windows Task Manager) when I added 1 new column using := to my data set containing ~200K rows by 2.5K columns.

I then tried 200M row by 25 col, the increase was from 6GB to 7.6GB before dropping to 7.25GB after a gc().

Specifically regarding adding of new columns, Matt Dowle himself mentioned here that:

With its := operator you can :

Add columns by reference
Modify subsets of existing columns by reference, and by group by reference
Delete columns by reference

None of these operations copy the (potentially large) data.table at all, not even once.

Question 1: why would adding a single column of 'NAs' for a DT with 2.5K columns double the peak memory used if the data.table is not copied at all?

Question 2: Why does the doubling not occur when the DT is 200M x 25? I didn't include the printscreen for this, but feel free to change my code and try.

Printscreens for Memory Usage using Test Code

  1. Clean re-boot, RStudio & MS Word opened - 103MB used 1. Clean re-boot, RStudio & MS Word opened

  2. Aft running DT creation code but before adding column - 3.5GB used enter image description here

  3. After adding 1 Column filled with NA, but before gc() - 6.8GB used enter image description here

  4. After running gc() - 3.5GB used enter image description here

Test Code

To investigate, I did up the following test code that closely mimics my data set:

library(data.table)
set.seed(1)

# Credit: Dirk Eddelbuettel's answer in 
# https://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
RandDate <- function(N, st="2000/01/01", et="2014/12/31") { 
  st <- as.POSIXct(as.Date(st))
  et <- as.POSIXct(as.Date(et))
  dt <- as.numeric(difftime(et,st,unit="sec"))
  ev <- runif(N, 0, dt)
  rt <- as.character( strptime(st + ev, "%Y-%m-%d") )
}

# Create Sample data
TotalNoCol <- 2500
TotalCharCol <- 3
TotalDateCol <- 1
TotalIntCol <- 600
TotalNumCol <- TotalNoCol - TotalCharCol - TotalDateCol - TotalIntCol
nrow <- 200000

ColNames = paste0("C", 1:TotalNoCol)

dt <- as.data.table( setNames( c(

  replicate( TotalCharCol, sample( state.name, nrow, replace = T ), simplify = F ), 
  replicate( TotalDateCol, RandDate( nrow ), simplify = F ), 
  replicate( TotalNumCol, round( runif( nrow, 1, 30 ), 2), simplify = F ), 
  replicate( TotalIntCol, sample( 1:10, nrow, replace = T ), simplify = F ) ), 

    ColNames ) )

gc()

# Add New columns, to be run separately
dt[, New_Col := NA ]  # Additional col; uses excessive memory?

Research Done

I didn't find too much discussion on memory usage for DT with many columns, only this but even then, it's not specifically about memory.

Most discussions on large dataset + memory usage involves DTs with very large rowcount but relatively few columns.

My System

Intel i7-4700 with 4-core/8-thread; 16GB DDR3-12800 RAM; Windows 8.1 64-bit; 500GB 7200rpm HDD; 64-bit R; Data Table ver 1.9.4

Disclaimers

Please pardon me for using a 'non-R' method (i.e. Task Manager) to measure memory used. Memory measurement/profiling in R is something I still haven't figured out.


Edit 1: After updating to data table ver 1.9.5 and re-running. Issue persisted, unfortunately.

enter image description here

Community
  • 1
  • 1
NoviceProg
  • 815
  • 1
  • 10
  • 22
  • You can use `?tables` to check the memory usage of your `data.table`'s. I get `3,358MB` before the last column and `3,359MB` after adding that column. – shadow Feb 05 '15 at 15:58
  • 1
    @shadow, I think you meant `tables()` and thanks for highlighting this function to me - I didn't know it before. I went through the function description, it appears to report the final memory used by DT, not the maximum memory used during processing (i.e. including working memory). Not sure how these figures relate to those reported by Task Manager. Please correct me if I'm wrong. – NoviceProg Feb 05 '15 at 16:53
  • 1
    Fyi, `?tables` takes you to the help page, so that was deliberate. Similarly, you can type `?truelength` for information on how data.tables handle memory. – Frank Feb 05 '15 at 17:00
  • 6
    This is most likely due to [this bug, #921](https://github.com/Rdatatable/data.table/issues/921), which is already fixed in `1.9.5`. There were some unnecessary copies (regression from `1.9.0`, I think) that were happening. I'd recommend trying again on `1.9.5`. – Arun Feb 05 '15 at 17:35
  • @Arun, I installed DT 1.9.5 and re-ran the test with 200K x 2.5K DT, unfortunately, the issue persisted. I have edited my post with the new printscreen showing memory of 6.8GB after adding the single NA column. – NoviceProg Feb 06 '15 at 15:47
  • 1
    Hm, seems like some copy is happening while simply returning a value from an internal function.. Will try to find a more straightforward example to reproduce, and see if it's fixable. Thanks. – Arun Feb 06 '15 at 21:42
  • @Arun, glad to be of help! Kindly update us when the issue is resolved. Of course, do let me know how else I can help. – NoviceProg Feb 07 '15 at 08:10
  • @Arun, to update you: I just tried if this 'peak memory doubling' issue has been resolved in the latest 1.9.5 - I'm afraid it's still there. I looked through Github, and there's no such issue filed yet. Shall I file one? – NoviceProg Mar 03 '15 at 15:00
  • @NoviceProg, yes please! – Arun Mar 03 '15 at 17:03
  • 1
    Hi @Arun, FYI, I have filed an issue at Github. Hope the fix is an easy one... – NoviceProg Mar 04 '15 at 11:02

1 Answers1

3

(I can take no credit as the great DT minds (Arun) have been working on this and found it was related to print.data.table. Just closing the loop here for other SO users.)

It seems this data.table memory issue with := has been solved on R version 3.2 as noted by: https://github.com/Rdatatable/data.table/issues/1062

[Quoting @Arun from Github issue 1062...]

fixed in R v3.2, IIUC, with this item from NEWS:

Auto-printing no longer duplicates objects when printing is dispatched to a method.

So others with this problem should look to upgrading to R 3.2.

Community
  • 1
  • 1
micstr
  • 5,080
  • 8
  • 48
  • 76
  • 1
    Thanks @micstr, I clean forgot I post this issue on SO. I tested it over the weekend and confirm the issue has been resolved in R3.2 for Linux. *Disclaimer: I have switched to using R in Linux so didn't test the issue in Windows. – NoviceProg Jun 02 '15 at 00:48