0

Using library sparklyr, I try to create a date variable in the Spark dataframe this way (which works in R):

# Researching SPARK --------------------------------------------------------------------------

#library(data.table)
library(sparklyr)
library(dplyr)

setwd('C:/Users/aburnakov/Desktop')

#spark_install(version = "2.1.0")

r_spark_connection <- spark_connect(master = "local")

sample_dat <- data.frame(When = as.character(
     c(
     "2018-01-15 03:05:02.177"
     , "2018-01-15 00:54:31.133"
     , "2018-01-15 21:24:06.013"
     , "2018-01-15 15:44:26.047" 
     , "2018-01-15 05:17:06.040"
     , "2018-01-15 06:41:08.183"
     , "2018-01-15 15:09:40.137"
     , "2018-01-15 03:15:43.820"
     , "2018-01-15 11:02:27.180"
     , "2018-01-15 18:23:42.047"
          )
     )
)

write.csv(x = sample_dat, file = 'sample_dat.csv')

## write raw data from csv to spark env ------------------------------------------

sample_spark_df <- spark_read_csv(
     sc = r_spark_connection
     , name = 'sample_dat'
     , path = 'sample_dat.csv'
     , header = T
     , infer_schema = F
     , delimiter = ","
     , quote = '"'
     , escape = '`'
     , charset = "UTF-8"
     , null_value = NULL
     , repartition = 10
     , memory = F
     , overwrite = T
)


## try either of two

sample_spark_df <- sample_spark_df %>%
     mutate(
          Date = as.Date(When, format = "%Y-%m-%d", tz = "")
     )

sample_spark_df <- sample_spark_df %>%
     mutate(
          datetime_when = as.POSIXct(strptime(x = When, format = "%Y-%m-%d %H:%M:%OS", tz = ""))
     )


## now observe the error

x <- collect(sample_spark_df)

Why is this? Can I still make the dates with indicating format and timezone?

similar troubles: Converting string/chr to date using sparklyr

Uwe
  • 41,420
  • 11
  • 90
  • 134
Alexey Burnakov
  • 259
  • 2
  • 14
  • I can't reproduce your error. Could you include a sample of sample_spark_df ? – Rilcon42 Feb 01 '18 at 20:07
  • @Rilcon42, I forgot to mention that this error is strangely not reported until I try to collect() the object from spark to R to check the result. collect() fails. I will update my question. – Alexey Burnakov Feb 02 '18 at 09:08
  • @Rilcon42, I actually have problem with **datetime_when = as.POSIXct(strptime(x = When, format = "%Y-%m-%d %H:%M:%OS", tz = ""))** getting another error as well. – Alexey Burnakov Feb 02 '18 at 09:24

1 Answers1

2

I have not been able to read your dates directly from sparklyr, but if you want to read them as POSIXct, you can use the DBI package for that:

res <- DBI::dbGetQuery(r_spark_connection, "Select _c0, When, 
unix_timestamp(When, \"yyyy-MM-dd HH:mm:ss.SSS\") as dateTS FROM sample_dat")

res %>% mutate(dateRestored = as.POSIXct(dateTS, origin = "1970-01-01"))
   _c0                    When     dateTS        dateRestored
1   10 2018-01-15 18:23:42.047 1516037022 2018-01-15 18:23:42
2    1 2018-01-15 03:05:02.177 1515981902 2018-01-15 03:05:02
3    2 2018-01-15 00:54:31.133 1515974071 2018-01-15 00:54:31
4    3 2018-01-15 21:24:06.013 1516047846 2018-01-15 21:24:06
5    4 2018-01-15 15:44:26.047 1516027466 2018-01-15 15:44:26
6    5 2018-01-15 05:17:06.040 1515989826 2018-01-15 05:17:06
7    6 2018-01-15 06:41:08.183 1515994868 2018-01-15 06:41:08
8    7 2018-01-15 15:09:40.137 1516025380 2018-01-15 15:09:40
9    8 2018-01-15 03:15:43.820 1515982543 2018-01-15 03:15:43
10   9 2018-01-15 11:02:27.180 1516010547 2018-01-15 11:02:27

The idea is to use this package to read the data using Spark SQL and use the function unix_timestamp to create a timestamp from the character dates. Then, you can use the as.POSIXct function for the timestamps.

I hope this helps.

Jaime Caffarel
  • 2,401
  • 4
  • 30
  • 42
  • Thank you! I will consider this option. I have already found a solution of using SQl calls inside the spark df formatting and managed to convert the strings - painfully - to dates. Just realized lots of things working in plain R are not supported the same was in sparklyr. – Alexey Burnakov Feb 05 '18 at 09:34
  • You are welcome :-) Working with dates is usually painful... I use DBI package to perform these kind of operations that might be somewhat easier (or has support) using Spark SQL. You can also use the `dbSendQuery()` function to "program" the execution of the query without getting the result into R. I use this function to create intermediate tables for example. – Jaime Caffarel Feb 05 '18 at 09:56
  • hey @JaimeCaffarel can I use `dbSendQuery` inside a `mutate` for a SPARK dataframe? I do not want to collect my result yet, but I need to create these time variables. – ℕʘʘḆḽḘ Apr 19 '18 at 14:30
  • @ℕʘʘḆḽḘ I'm not really sure about that. `dbSendQuery` returns a `DBIResult`, and `mutate` receives a `tbl`. So a kind of "integration" between them would be necesary. On the other hand, I don't think that you can directly execute the `as.POSIXct` function inside Spark with sparklyr, but I might be wrong... – Jaime Caffarel Apr 19 '18 at 14:46