1

For a dataframe, I'd like to save the data class of each column (eg. char, double, factor) to a csv, and then be able to read both the data, and the classes, back into R.

For example, my data might look like this:

df
#> # A tibble: 3 × 3
#>    item  cost blue 
#>   <int> <int> <fct>
#> 1     1     4 1    
#> 2     2    10 1    
#> 3     3     3 0

(code for data input here:)

library(tidyverse)
df <- tibble::tribble(
  ~item, ~cost, ~blue,
     1L,    4L,    1L,
     2L,   10L,    1L,
     3L,    3L,    0L
  )

df <- df %>% 
  mutate(blue = as.factor(blue))
df

I'm able to save the classes of the data, and the data, this way:

library(tidyverse)
classes <- map_df(df, class)

write_csv(classes, "classes.csv")
write_csv(df, "data.csv")

and I can read it back this way:

classes <- read.csv("classes.csv") %>% 
  slice(1) %>% 
  unlist()
classes
df2 <- read_csv("data.csv", col_types = classes)
df2

Is there a quicker way to do all of this?

Particularly with the way I'm saving classes and then reading it back in, then slicing and unlisting?

Jeremy K.
  • 1,710
  • 14
  • 35

3 Answers3

2

You could use writeLines and its counterpart readLines for the classes. Like this:

classes <- sapply(df, class)
writeLines(classes, "classes.txt")
#to read them
readLines("classes.txt")

However, consider also other formats like parquet (the R implementation is provided by the arrow package) for instance that preserve the data types and are implemented by many languages.

nicola
  • 24,005
  • 3
  • 35
  • 56
2

Try the csvy package. Also see the http://csvy.org/ site. This generates a single file rather than two files simplifying working with it (or optionally it can write the metadata to a separate file), there are csvy readers available in some other languages as well (see link just cited), the format is standardized and backwards compatible with csv which is probably better than rolling your own format.

library(csvy)
write_csvy(df, "df.csvy")

This produces this file:

#---
#profile: tabular-data-package
#name: df
#fields:
#- name: item
#  type: integer
#- name: cost
#  type: integer
#- name: blue
#  type: integer
#--- 
item,cost,blue
1,4,1
2,10,1
3,3,0

which can be read back in using:

read_csvy("df.csvy")

or read.csv("df.csvy", comment.char = "#") or any number of R packages which have functions to read csv files.

We can extract the metadata as a list using:

library(yaml)
md <- get_yaml_header("df.csvy")
md_list <- yaml.load(paste(md, collapse = "\n"))

str(md_list)
## List of 3
##  $ profile: chr "tabular-data-package"
##  $ name   : chr "df"
##  $ fields :List of 3
##   ..$ :List of 2
##   .. ..$ name: chr "item"
##   .. ..$ type: chr "integer"
##   ..$ :List of 2
##   .. ..$ name: chr "cost"
##   .. ..$ type: chr "integer"
##   ..$ :List of 2
##   .. ..$ name: chr "blue"
##   .. ..$ type: chr "integer"

Added

fwrite in the data.table package with argument yaml=TRUE can be used to write csvy files with slightly different content in the yaml headers. The export function of the rio package can also generate csvy files using fwrite.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

Another alternative is to use the readr::spec_csv() argument.

  1. take the csv file or final dataframe that you want and pass it through to spec_csv(), this will produce a cols class that will provide the column names and col_type arguments.

  2. you can save that as csv and then directly upload that to your col_type argument for future use

alejandro_hagan
  • 843
  • 2
  • 13