Function to convert R types to Spark types

Question

I have an R data frame that I would like to convert into a Spark data frame on a remote cluster. I have decided to write my data frame to an intermediate csv file that is then read using sparklyr::spark_read_csv(). I am doing this as the data frame is too big to send directly using sparklyr::sdf_copy_to() (which I think is due to a limitation in Livy).

I would like to programatically transfer R column types used in the data frame to the new spark data frame by writing a function that returns a named vector that I can use with the columns argument in spark_read_csv().

score 0 · Answer 1 · answered Mar 28 '19 at 07:39

0

Please go through the Apache Arrow project, it has the support for conversion of native types to spark types.
Create a vector of ur current datatypes and map it to spark using cast.

These are the only two ways I can think of right now.

answered Mar 28 '19 at 07:39

nareshbabral

821
1
10
19

This looks potentially useful, thanks. Does Arrow work with remote connections? – Alex Mar 28 '19 at 20:18

score 0 · Accepted Answer · answered Apr 02 '19 at 00:20

I only have rudimentary knowledge of mapping R data types (specifically, returned by the class() function) to Spark data types. However, the following function seems to work as I expect. Hopefully others will find it useful/improve it:

get_spark_data_types_from_data_frame_types <- function(df) {



    r_types <-
        c("logical", "numeric", "integer", "character", "list", "factor")

    spark_types <-
        c("boolean", "double", "integer", "string", "array", "string")

    types_in <- sapply(df, class)    


    types_out <- spark_types[match(types_in, r_types)]

    types_out[is.na(types_out)] <- "string" # initialise to character by default

    names(types_out) <- names(df)

    return(types_out)

}

Function to convert R types to Spark types

2 Answers2