0

I have a function that deletes specified column from the provided data.table, and it works fine with piping:

require(data.table)
require(dplyr)

data(iris)
dt.iris <- as.data.table(iris)


my_del_cols <- function(inpDT, cols2del=c('Species')){
  inpDT[,c(cols2del):=NULL]
  inpDT
}

dt.ok <- 
  dt.iris[Species=='setosa'] %>% 
  my_del_cols()

str(dt.iris) # Source table not changed:
# Classes ‘data.table’ and 'data.frame':    150 obs. of  5 variables:
# $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, ".internal.selfref")=<externalptr> 
  
print(dt.iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#   1:          5.1         3.5          1.4         0.2    setosa
#   2:          4.9         3.0          1.4         0.2    setosa
#   3:          4.7         3.2          1.3         0.2    setosa
#   4:          4.6         3.1          1.5         0.2    setosa
#   5:          5.0         3.6          1.4         0.2    setosa
# ---                                                            
# 146:          6.7         3.0          5.2         2.3 virginica
# 147:          6.3         2.5          5.0         1.9 virginica
# 148:          6.5         3.0          5.2         2.0 virginica
# 149:          6.2         3.4          5.4         2.3 virginica
# 150:          5.9         3.0          5.1         1.8 virginica

View(dt.iris) # works correctly in RStudio

But if I use %>% dplyr::filter(condition) instead of data.table::[condition], this causes a weird bug where the column in the source table is renamed to NA:

dt.bad <- 
  dt.iris %>% 
  dplyr::filter(Species=='setosa') %>% 
  my_del_cols()

str(dt.iris) # Note the change in the source table, last column:
# Classes ‘data.table’ and 'data.frame':    150 obs. of  5 variables:
# $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ NA: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, ".internal.selfref")=<externalptr> 
  
print(dt.iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width        NA
#   1:          5.1         3.5          1.4         0.2    setosa
#   2:          4.9         3.0          1.4         0.2    setosa
#   3:          4.7         3.2          1.3         0.2    setosa
#   4:          4.6         3.1          1.5         0.2    setosa
#   5:          5.0         3.6          1.4         0.2    setosa
# ---                                                            
# 146:          6.7         3.0          5.2         2.3 virginica
# 147:          6.3         2.5          5.0         1.9 virginica
# 148:          6.5         3.0          5.2         2.0 virginica
# 149:          6.2         3.4          5.4         2.3 virginica
# 150:          5.9         3.0          5.1         1.8 virginica

View(dt.iris)
# Error in View : Internal error: length of names (4) is not length of dt (5)

In my understanding, this happens because data.table::[condition] creates a copy of the table while filter() does not do it. My first workaround was creating a copy of input object in my function:

my_del_cols_copy <- function(inpDT, cols2del=c('Species')){
  inpDT <- copy(inpDT) # added
  inpDT[,c(cols2del):=NULL]
  inpDT
}

but I was not very happy with it because sometimes I do want to modify the input table by reference without creating a copy. Another solution I found was to add copy() before or after filter():

dt.ok2 <- 
  dt.iris %>% 
  dplyr::filter(Species=='setosa') %>% 
  copy() %>%             # added
  my_del_cols()

This works (though I am not very happy either as it requires adding that extra line) but I am still puzzled why dplyr::filter() messes with column names.
UPD. came to another hack, to force copying the object without adding copy():

dt.ok <- 
  dt.iris[T] %>%                 # dt.iris[T] forces copying
  dplyr::filter(Species=='setosa') %>% 
  my_del_cols()
Vasily A
  • 8,256
  • 10
  • 42
  • 76
  • 1
    Use dtplyr instead of dplyr. It's purpose built to work on dt with dplyr syntax. – Dean MacGregor Jul 20 '23 at 07:44
  • Thanks Dean! I didn't know about `dtplyr`. Can you explain how to use it for this example? I tried just converting my table with `lazy_dt()`, like this `dt.test <- dt.iris %>% dtplyr::lazy_dt() %>% dplyr::filter(Species=='setosa') %>% my_del_cols()` - but then my function raises an error: `Error in my_del_cols(): ! := can only be used within dynamic dots.` – Vasily A Jul 20 '23 at 07:53
  • This could be helpful – TarJae Jul 20 '23 at 08:02
  • @VasilyA sorry I never got into dplyr so while I know that dtplyr exists I don't really know how to use it. – Dean MacGregor Jul 20 '23 at 08:36

0 Answers0