5

I read in an SO topic an answer from Matt Dowle about a shallow function to make shallow copies in data.table. However, I can't find the topic again.

data.table does not have any exported function called shallow. There is an internal one but not documented. Can I use it safely? What is its behavior?

What I would like to do is a memory efficient copy of a big table. Let DT be a big table with n columns and f a function which memory efficiently adds a column. Is something like that possible?

DT2 = f(DT)

with DT2 being a data.table with n columns pointing to the original adresses (no deep copies) and an extra one existing only for DT2. If yes, what appends to DT1 if I do DT2[, col3 := NULL]?

Tonio Liebrand
  • 17,189
  • 4
  • 39
  • 59
JRR
  • 3,024
  • 2
  • 13
  • 37

1 Answers1

4

You can't use data.table:::shallow safely, no. It is deliberately not exported and not meant for user use. Either from the point of view of it itself working, or its name or arguments changing in future.

Having said this, you could decide to use it as long as you can either i) guarantee that := or set* won't be called on the result either by you or your users (if you're creating a package) or ii) if := or set* is called on the result then you're ok with both objects being changed by reference. When shallow is used internally by data.table, that's what we promise ourselves.

More background in this answer a few days ago here : https://stackoverflow.com/a/45891502/403310

In that question I asked for the bigger picture: why is this needed? Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count.

In your question you alluded to your bigger picture and that is very useful. So you'd like to create a function which adds working columns to a big data.table inside the function but doesn't change the big data.table. Can you explain more why you'd like to create a function like that? Why not load the big data.table, add the ephemeral working columns directly to it, and then proceed. Your R session is already a working copy in memory of the data which is persistent somewhere else.

Note that I am not saying no. I'm not saying that you don't have a valid reason. I'm asking to discover more about that valid reason so the priority can be raised.

If that isn't the answer you had seen, there are currently 39 question or answers returned by the search string "[data.table] shallow". Worst case, you could trawl through those to find it again.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Thanks. Actually it was more or less a question for myself for a better understanding of shallow copies in `data.table`. But also for my users. Actually users are not familiar with update by reference. Some of my functions are designed like `set*` functions and add columns by reference. But I often get emails to ask me why the functions return NULL (`invisible()`) even if the documentation clearly states the internal behavior. I was wondering if a shallow-copy could make the workflow more straightforward but also memory efficient the same way. – JRR Aug 28 '17 at 19:37
  • To be clear: I never thought that my idea was a good idea. And I didn't plan to program such function. Actually I like the `set*` semantic and it is the best way and the safest for sure. I was just questioning about the technical possibilities because I like technical questions. – JRR Aug 28 '17 at 19:50
  • (Another user here:) Regarding the why, I have a function that is called many times (by `optim`) that creates a dynamically determined number of columns that I don't need outside the function. With `shallow` instead of the raw input, I figure I can avoid hassle with clearing those columns out. I also use it with one-time functions, just to avoid being concerned about my input data being messed with; i could probably switch to using `copy` for those, I suppose. – Frank Aug 28 '17 at 20:15
  • 1
    Thanks all. Agreed it is needed. What am I asking for I suppose? I guess a runable example showing something non-trivial that shows the need and that I can test with. I'm thinking that both `copy()` and `DT[,..someCols]` could be changed to return shallow copy and that internal `shallow` could then be removed. As long as `:=` could reliably read the column's ref count (that was the problem before). – Matt Dowle Aug 28 '17 at 20:51
  • 1
    A runable example on public data to form the basis of an article, would certainly help to increase the priority. Especially if someone else writes it. – Matt Dowle Aug 28 '17 at 20:56
  • *DT[,..someCols] could be changed to return shallow* I do agree. A lot! – JRR Aug 28 '17 at 20:57