Sparklyr/Hive: how to use regex (regexp_replace) correctly?

Question

Consider the following example

dataframe_test<- data_frame(mydate = c('2011-03-01T00:00:04.226Z', '2011-03-01T00:00:04.226Z'))

# A tibble: 2 x 1
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

sdf <- copy_to(sc, dataframe_test, overwrite = TRUE)

> sdf
# Source:   table<dataframe_test> [?? x 1]
# Database: spark_connection
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

I would like to modify the character timestamp so that it has a more conventional format. I tried to do so using regexp_replace but it fails.

> sdf <- sdf %>% mutate(regex = regexp_replace(mydate, '(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}).(\\d{3})Z', '$1-$2-$3 $4:$5:$6.$7'))
> sdf
# Source:   lazy query [?? x 2]
# Database: spark_connection
                    mydate                    regex
                     <chr>                    <chr>
1 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z

Any ideas? What is the correct syntax?

The pattern is correct (you could use literal `.` in place of wildcard), you're just using a wrong function. — zero323, Jun 20 '17 at 17:56
wait a sec, please. which function should I use? your link actually specifies the same function I use — ℕʘʘḆḽḘ, Jun 20 '17 at 18:00
Take a closer look - it is `regexp_replace`, not `regexp_extract` :) — zero323, Jun 20 '17 at 18:01
actually it still does not work... let me update the question. but thanks for the catch!!!! — ℕʘʘḆḽḘ, Jun 20 '17 at 18:02
I believe this is still a duplicate - I was just wrong about the pattern. Please note that it has to match a whole string and you didn't escape everything: `sdf %>% mutate(regex = regexp_replace(mydate, '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$', '$1-$2-$3 $4:$5:$6.$7'))`. You could use `regexp_extact`, but it would require enumerating all fields `sdf %>% mutate(regex = regexp_extract(mydate, '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$', 1))` — zero323, Jun 20 '17 at 18:12
argh that works now. well in that case I think this is an interesting answer. why do I need to escape twice? the official documentation says `\d` needs to be written `\\d`. Instead you suggest I should escape even more. why is that? — ℕʘʘḆḽḘ, Jun 20 '17 at 18:15
You have to escape once for R and once for Java I am afraid. If you think this should be a separate answer, I can reopen it. — zero323, Jun 20 '17 at 18:18
yes, I think this is a very useful answer. Nobody knows that you need to combine the escapes. one for each language ! :D — ℕʘʘḆḽḘ, Jun 20 '17 at 18:19

zero323 · Accepted Answer · 2017-06-21T13:37:32.933

Spark SQL and Hive provide two different functions:

regexp_extract - which takes string, pattern and the index of the group to be extracted.
regexp_replace - which takes a string, pattern, and the replacement string.

The former one can be used to extract a single group with the index semantics being the same as for java.util.regex.Matcher

For regexp_replace pattern has to match a whole string and if there is no match, and the input string is returned:

sdf %>% mutate(
 regex = regexp_replace(mydate, '^([0-9]{4}).*', "$1"),
 regexp_bad = regexp_replace(mydate, '([0-9]{4})', "$1"))

## Source:   query [2 x 3]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 3
##                     mydate regex               regexp_bad
##                      <chr> <chr>                    <chr>
## 1 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z
## 2 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z

while with regexp_extract it is not required:

sdf %>% mutate(regex = regexp_extract(mydate, '([0-9]{4})', 1))

## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate regex
##                      <chr> <chr>
## 1 2011-03-01T00:00:04.226Z  2011
## 2 2011-03-01T00:00:04.226Z  2011

Also, due to indirect execution (R -> Java), you have to escape twice:

sdf %>% mutate(
  regex = regexp_replace(
    mydate, 
    '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$',
    '$1-$2-$3 $4:$5:$6.$7'))

Normally one would use Spark datetime functions:

spark_session(sc) %>%  
  invoke("sql",
    "SELECT *, DATE_FORMAT(CAST(mydate AS timestamp), 'yyyy-MM-dd HH:mm:ss.SSS') parsed from dataframe_test") %>% 
  sdf_register


## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate                  parsed
##                      <chr>                   <chr>
## 1 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226
## 2 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226

but sadly sparklyr seems to be extremely limited in this area, and treats timestamps as strings.

See also change string in DF using hive command and mutate with sparklyr.

really interesting solution – ℕʘʘḆḽḘ Jun 20 '17 at 18:43 — ℕʘʘḆḽḘ, Jun 20 '17 at 18:43

score 3 · Answer 2 · edited Feb 27 '18 at 10:05

3

I had some difficulties to replace "." with "", but finally it works with:

mutate(myvar2=regexp_replace(myvar, "[.]", ""))

edited Feb 27 '18 at 10:05

Pang

9,564
146
81
122

answered Feb 27 '18 at 09:47

SébastienQUINAULT

31
1

Sparklyr/Hive: how to use regex (regexp_replace) correctly?

2 Answers2

Linked