1

I have a data base with 250 columns and want to read only 50 of them instead of loading all of them then dropping columns with dplyr::select. I suppose I can do that using a column specification. I don't want to type the column specification manually for all those columns.

The 50 columns I want to keep have a common prefix, say 'blop', so I managed to manually change the column specification object I got from readr::spec_csv. I then used it to read my data file :

short_colspec <- readr::spec_csv('myfile.csv')
short_colspec$cols <- lapply(names(short_colspec$cols), function(name){
    if (substr(name, 1, 4) == 'blop'){
        return(col_character())
    } else {
        return(col_skip())
    }
})
short_data <- read_csv('myfile.csv', col_types = short_colspec)

Is there a way to specify such a column specification with readr (or any other package) functions in a more robust way than what I did ?

Romain
  • 1,931
  • 1
  • 13
  • 24
  • Using `readr` it looks like how you've done it is the approach to take. It could potentially be improved with `grepl` as opposed to `substr` so as to be more flexible. Also your code assumes the columns you are reading in are all `col_character` – Chris Sep 04 '18 at 15:15
  • Indeed `grepl` provides more flexibility ! I assumed only string variables for the sake of simplicity, it is true that the `if else` syntax can be expanded to take into account more column types. – Romain Sep 05 '18 at 07:30

1 Answers1

1

using data.table's fread you can select columns you want to skip (=drop) or keep (=select)

#read first line of file to select which columns to keep
#adjust the strsplit-character here ';' according to your csv-type
keep_col <- readLines( "myfile.csv", n = 1L ) %>% strsplit( ";" ) %>% el() %>% grep( "blop", . )
#read file, only the desired columns
fread( "myfile.csv", select = keep_col )
Wimpel
  • 26,031
  • 1
  • 20
  • 37