5

I have some unix times that I convert to timestamps in sparklyr and for some reasons I also need to convert them into strings.

Unfortunately, it seems that during the conversion to string hive converts to EST (my locale).

df_new <- spark_read_parquet(sc, "/mypath/parquet_*",
                             overwrite = TRUE,
                             name = "df_new",
                             memory = FALSE,
                             options = list(mergeSchema = "true"))
> df_new %>%  
           mutate(unix_t = from_utc_timestamp(timestamp(t) ,'UTC'),
           date_str = date_format(unix_t, 'yyyy-MM-dd HH:mm:ss z'),
           date_alt = to_date(from_utc_timestamp(timestamp(t) ,'UTC'))) %>% 
    select(t, unix_t, date_str, date_alt) %>% head(5)
# Source:   lazy query [?? x 4]
# Database: spark_connection
            t unix_t              date_str                date_alt  
        <dbl> <dttm>              <chr>                   <date>    
1 1419547405. 2014-12-25 22:43:25 2014-12-25 17:43:25 EST 2014-12-25
2 1418469714. 2014-12-13 11:21:54 2014-12-13 06:21:54 EST 2014-12-13
3 1419126103. 2014-12-21 01:41:43 2014-12-20 20:41:43 EST 2014-12-20
4 1419389856. 2014-12-24 02:57:36 2014-12-23 21:57:36 EST 2014-12-23
5 1418271811. 2014-12-11 04:23:31 2014-12-10 23:23:31 EST 2014-12-10

As you can see both date_str and date_alt use the EST timezone. I need UTC here. How can I do that?

Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

3 Answers3

7

From the Hive function reference, date_format uses Java's SimpleDateFormat, which I believe always defaults to the JVM time zone, this explains why this gets you a character string converted to your time zone.

One option is to detect the time zone and manually add the hours to get UTC.

Another option would be to use lubridate with spark_apply():

sdf_len(sc, 1) %>%
  mutate(unix_t = from_utc_timestamp(timestamp(1522371003) , 'UDT')) %>%
  spark_apply(
    function(e) {
      dplyr::mutate(
        e,
        time_str = as.character(
          lubridate::with_tz(
            as.POSIXct(unix_t, origin="1970-01-01"),
            "GMT"
          )
        )
      )
    },
    columns = c("id", "unix_t", "time_str"))
Javier Luraschi
  • 912
  • 5
  • 4
0

It's possible that sparklyr is doing some weird translation of timezones into the hive functions. I'd try registering the dataframe as a table and doing the manipulation with pure HQL:

createOrReplaceTempView(df_new, "df_new")
result <- sql("select from_utc_timestamp(timestamp(t) ,'UTC'),
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as STRING),
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as DATE)
   from df_new")
head(result)

edit

If you're unfamiliar with SQL-languages, you can add any of the variables from df_new as a comma separated list like so (and rename your selections with as)

select var1, var2, t,
   from_utc_timestamp(timestamp(t) ,'UTC') as unix_t,
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as STRING) as date_str,
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as DATE) as date_alt
   from df_new

You can also use * to represent all variables from the data frame:

select *,
   from_utc_timestamp(timestamp(t) ,'UTC') as unix_t,
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as STRING) as date_str,
   cast(from_utc_timestamp(timestamp(t) ,'UTC') as DATE) as date_alt
   from df_new
Pdubbs
  • 1,967
  • 2
  • 11
  • 20
  • do you know how can I use these SQL `cast` directly in the mutate above? – ℕʘʘḆḽḘ Mar 29 '18 at 19:13
  • I believe all of the casting in mutate has to be done through R functions, and that those are the source of the issue, which is why I suggest using HQL to do this particular manipulation. Is there a reason it needs to be in the mutate? – Pdubbs Mar 29 '18 at 20:35
  • thanks, i dont need the mutate, but i do need all the other variable as well in the same table. – ℕʘʘḆḽḘ Mar 29 '18 at 20:39
  • you can add them to the select. If you're unfamiliar with SQL, I'll edit to show how. – Pdubbs Mar 30 '18 at 00:00
-1

Try using as.POSIXct() ?

format(as.POSIXct(unix_t, origin = unix_t, tz = "UTC", usetz=TRUE),"%Y-%m-%d %H:mm:ss") 

This will first convert unix timestamp to UTC and then formatted to desired string.

Soni007
  • 103
  • 2
  • 13