4

I would like to extract the last part of the string (after the last forward slash). When I use the following code it fails with the error:

library(sparklyr)
library(tidyverse)

sc <- spark_connect(method = "databricks")

tibble(my_string = c("aaa/bbb/ccc", "ddd/eee/fff", "ggg/hhh/iii")) %>%
  copy_to(sc, ., "test_regexp_extract", overwrite = TRUE) %>%
  mutate(my_result = regexp_extract(my_string, "(?<=/)[^/]*$")) %>%
  count(my_result)

java.lang.IllegalArgumentException: Regex group count is 0, but the specified group index is 1

show_query() shows the following query:

<SQL>
SELECT `my_result`, COUNT(*) AS `n`
FROM (SELECT `my_string`, regexp_extract(`my_string`, "(?<=/)[^/]*$") AS `my_result`
FROM `test_regexp_extract`) `q01`
GROUP BY `my_result`

I suspect it might be the problem of the forward slash character in the regex. I tried to escape it with a backslash.
Another idea is that for some reason it fails when no match is found. According to the Spark documentation it should work though:

If the regex did not match, or the specified group did not match, an empty string is returned.

Any idea why it fails? In R, the same code works fine.

ZygD
  • 22,092
  • 39
  • 79
  • 102
Piotr K
  • 943
  • 9
  • 20
  • So have you tried escaping the `/` : `(?<=\/)[^\/]*$` ? – gatear Dec 01 '21 at 12:24
  • Yes, this is what I tried. The error goes: Error : '\/' is an unrecognized escape in character string starting ""(?<=\/" Some( Error: '\/' is an unrecognized escape in character string starting ""(?<=\/" ) – Piotr K Dec 01 '21 at 13:49
  • `regexp_extract(my_string, "(?<=\\/)[^\\/]*$"))` translates into `(?<=\/)[^\/]*$` but the error is again `Regex group count is 0, but the specified group index is 1`. – Piotr K Dec 01 '21 at 13:52
  • 1
    I managed to make it work by using this regex `([^/]*)$` instead. Still, I would love to learn what went wrong in the original regex. – Piotr K Dec 01 '21 at 17:08
  • 1
    You need to specify at least one capture group in the regex pattern (just like you did in your working version `([^/]*)$`) – 00schneider Mar 15 '23 at 10:14

1 Answers1

1

regexp_extract by default returns the first group. Your regex pattern (?<=/)[^/]*$ has no groups, so this is where the problem lies. You could:

  • create a group using (), e.g. (?<=/)([^/]*)$
  • add 0 as the 3rd parameter to the function: regexp_extract(my_string, "(?<=/)[^/]*$", 0). This way, the full regex match will be returned as opposed to only the first group.
ZygD
  • 22,092
  • 39
  • 79
  • 102