0

I have to read the data file in feather format with predefined set of columns. It generates an error if the column does not exists in the data file. How to check it "before" reading data set

library(feather)

# 1. Data set
df_mtcars <- mtcars

# 2. Drop column
df_mtcars$mpg <- NULL

# 3. Save data
write_feather(df_mtcars, "df_mtcars")

# 4. How check column existance in file 'before' reading
if(!is.null(...)) {
  read_feather("df_mtcars", columns = c("mpg"))
}

Thanks!

Andrii
  • 2,843
  • 27
  • 33
  • Not sure about `feather` (never used it) but maybe we can use `hasName(df_mtcars, "mpg")` – dario Mar 09 '20 at 14:13
  • The issue to check column existence on the file level "before" reading in memory – Andrii Mar 09 '20 at 14:22
  • @Andrii `feather` is a binary file format. If you look at the source code for `read_feather`, it reads the _whole_ file into memory by calling `feather(path)` then selects the columns you want. So your best bet is to do `read_feather_column <- function(path, column) {df <- feather(path); if(hasName(data, column)) return(df[column])}` – Allan Cameron Mar 09 '20 at 14:31

2 Answers2

1

feather is a binary file format. If you look at the source code for read_feather, it reads the whole file into memory by calling feather(path) then selects the columns you want. Look:

read_feather
#> function (path, columns = NULL) 
#> {
#>     data <- feather(path)
#>     on.exit(close(data), add = TRUE)
#>     if (is.null(columns)) 
#>         as_tibble(data)
#>     else as_tibble(data[columns])
#> }
#> <bytecode: 0x376de188>
#> <environment: namespace:feather>

The (uncompressed) column names are in the file, but they are not at reliable locations, because they appear after variable-length data fields, so there is no way to just read a small portion of the binary file an get the names.

So your best bet is to do something similar that first checks for existence of the specified column:

read_feather_column <- function(path, column) 
{
  df <- feather(path)
  if(hasName(df, column)) 
    return(as_tibble(df[column]))
}
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Hi, Allan! Thanks for the solution. Please, have a look on my solution - I just read meta data before file reading – Andrii Mar 09 '20 at 14:38
  • Nice solution @Andrii. However, the metadata function still reads the whole file into memory before returning TRUE or FALSE, so checking it using your method would involve reading the file twice into memory to get the data you wanted if it existed. It would be quicker to read it once and return the data if it exists. – Allan Cameron Mar 09 '20 at 14:52
1

Here is the function I design to solve this issue

#' Check if column exist in feather file
#' @param file_name path to the feather file
#' @param column_name name of column to check
#' @return logical value 'TRUE' if 'column_name' exist in file
is_column_feather_file <- function(file_name, column_name) {

  # 1. Init result
  result <- FALSE

  # 2. Read meta data and search for 'column_name'
  if(file.exists(file_name) & (column_name != "") & !is.null(column_name)) {

    # 2. 1. Meta data
    df_meta_data <- feather_metadata(file_name)

    # 2.2. Check if column exists
    result <- sum(names(df_meta_data$types) == column_name) == 1

  }

  # 3. Return result
  result

}


# Test
is_column_feather_file("mt_cars", "mpg")
Andrii
  • 2,843
  • 27
  • 33