0

I have an R data frame that I would like to convert into a Spark data frame on a remote cluster. I have decided to write my data frame to an intermediate csv file that is then read using sparklyr::spark_read_csv(). I am doing this as the data frame is too big to send directly using sparklyr::sdf_copy_to() (which I think is due to a limitation in Livy).

I would like to programatically transfer R column types used in the data frame to the new spark data frame by writing a function that returns a named vector that I can use with the columns argument in spark_read_csv().

Alex
  • 15,186
  • 15
  • 73
  • 127

2 Answers2

0
  1. Please go through the Apache Arrow project, it has the support for conversion of native types to spark types.
  2. Create a vector of ur current datatypes and map it to spark using cast.

These are the only two ways I can think of right now.

nareshbabral
  • 821
  • 1
  • 10
  • 19
0

I only have rudimentary knowledge of mapping R data types (specifically, returned by the class() function) to Spark data types. However, the following function seems to work as I expect. Hopefully others will find it useful/improve it:

get_spark_data_types_from_data_frame_types <- function(df) {



    r_types <-
        c("logical", "numeric", "integer", "character", "list", "factor")

    spark_types <-
        c("boolean", "double", "integer", "string", "array", "string")

    types_in <- sapply(df, class)    


    types_out <- spark_types[match(types_in, r_types)]

    types_out[is.na(types_out)] <- "string" # initialise to character by default

    names(types_out) <- names(df)

    return(types_out)

}
Alex
  • 15,186
  • 15
  • 73
  • 127