I have a function that deletes specified column from the provided data.table
, and it works fine with piping:
require(data.table)
require(dplyr)
data(iris)
dt.iris <- as.data.table(iris)
my_del_cols <- function(inpDT, cols2del=c('Species')){
inpDT[,c(cols2del):=NULL]
inpDT
}
dt.ok <-
dt.iris[Species=='setosa'] %>%
my_del_cols()
str(dt.iris) # Source table not changed:
# Classes ‘data.table’ and 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, ".internal.selfref")=<externalptr>
print(dt.iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.1 3.5 1.4 0.2 setosa
# 2: 4.9 3.0 1.4 0.2 setosa
# 3: 4.7 3.2 1.3 0.2 setosa
# 4: 4.6 3.1 1.5 0.2 setosa
# 5: 5.0 3.6 1.4 0.2 setosa
# ---
# 146: 6.7 3.0 5.2 2.3 virginica
# 147: 6.3 2.5 5.0 1.9 virginica
# 148: 6.5 3.0 5.2 2.0 virginica
# 149: 6.2 3.4 5.4 2.3 virginica
# 150: 5.9 3.0 5.1 1.8 virginica
View(dt.iris) # works correctly in RStudio
But if I use %>% dplyr::filter(condition)
instead of data.table::[condition]
, this causes a weird bug where the column in the source table is renamed to NA
:
dt.bad <-
dt.iris %>%
dplyr::filter(Species=='setosa') %>%
my_del_cols()
str(dt.iris) # Note the change in the source table, last column:
# Classes ‘data.table’ and 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ NA: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, ".internal.selfref")=<externalptr>
print(dt.iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width NA
# 1: 5.1 3.5 1.4 0.2 setosa
# 2: 4.9 3.0 1.4 0.2 setosa
# 3: 4.7 3.2 1.3 0.2 setosa
# 4: 4.6 3.1 1.5 0.2 setosa
# 5: 5.0 3.6 1.4 0.2 setosa
# ---
# 146: 6.7 3.0 5.2 2.3 virginica
# 147: 6.3 2.5 5.0 1.9 virginica
# 148: 6.5 3.0 5.2 2.0 virginica
# 149: 6.2 3.4 5.4 2.3 virginica
# 150: 5.9 3.0 5.1 1.8 virginica
View(dt.iris)
# Error in View : Internal error: length of names (4) is not length of dt (5)
In my understanding, this happens because data.table::[condition]
creates a copy of the table while filter()
does not do it. My first workaround was creating a copy of input object in my function:
my_del_cols_copy <- function(inpDT, cols2del=c('Species')){
inpDT <- copy(inpDT) # added
inpDT[,c(cols2del):=NULL]
inpDT
}
but I was not very happy with it because sometimes I do want to modify the input table by reference without creating a copy. Another solution I found was to add copy()
before or after filter()
:
dt.ok2 <-
dt.iris %>%
dplyr::filter(Species=='setosa') %>%
copy() %>% # added
my_del_cols()
This works (though I am not very happy either as it requires adding that extra line) but I am still puzzled why dplyr::filter()
messes with column names.
UPD. came to another hack, to force copying the object without adding copy()
:
dt.ok <-
dt.iris[T] %>% # dt.iris[T] forces copying
dplyr::filter(Species=='setosa') %>%
my_del_cols()