2

I'm reading in data from another platform where a combination of the strings listed below is used for expressing timestamps:

\* = current time 
t = current day (00:00)
mo = month 
d = days 
h = hours
m = minutes 

For example, *-3d is current time minus 3 days, t-3h is three hours before today morning (midnight yesterday).

I'd like to be able to ingest these equations into R and get the corresponding POSIXct value. I'm trying using regex in the below function but lose the numeric multiplier for each string:

strTimeConverter <- function(z){
  ret <- stringi::stri_replace_all_regex(
    str = z, 
    pattern = c('^\\*', 
                '^t', 
                '([[:digit:]]{1,})mo', 
                '([[:digit:]]{1,})d', 
                '([[:digit:]]{1,})h',
                '([[:digit:]]{1,})m'),
    replacement = c('Sys.time()', 
                    'Sys.Date()', 
                    '*lubridate::months(1)', 
                    '*lubridate::days(1)', 
                    '*lubridate::hours(1)', 
                    '*lubridate::minutes(1)'),
    vectorize_all = F
  )
  return(ret)
  # return(eval(expr = parse(text = ret)))
}

> strTimeConverter('*-5mo+3d+4h+2m')
[1] "Sys.time()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"

> strTimeConverter('t-5mo+3d+4h+2m')
[1] "Sys.Date()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"

Expected output:

# *-5mo+3d+4h+2m
"Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"

# t-5mo+3d+4h+2m
"Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"

I assumed that wrapping the [[:digit]]{1,} in parentheses () would preserve them but clearly that's not working. I defined the pattern like this else the code replaces repeat occurrences e.g. * gets converted to Sys.time() but then the m in Sys.time() gets replaced with *lubridate::minutes(1).

I plan on converting the (expected) output to R date-time using eval(parse(text = ...)) - currently commented out in the function.

I'm open to using other packages or approach.

Update

After tinkering around for a bit, I found the below version works - I'm replacing strings in the order such that newly replaced characters are not replaced again:

strTimeConverter <- function(z){
  ret <- stringi::stri_replace_all_regex(
    str = z, 
    pattern = c('y', 'd', 'h', 'mo', 'm', '^t', '^\\*'),
    replacement = c('*years(1)',
                    '*days(1)', 
                    '*hours(1)', 
                    '*days(30)',
                    '*minutes(1)',
                    'Sys.Date()', 
                    'Sys.time()'),
    vectorize_all = F
  )
  ret <- gsub(pattern = '\\*', replacement = '*lubridate::', x = ret)
  rdate <- (eval(expr = parse(text = ret)))
  attr(rdate, 'tzone') <- 'UTC'
  return(rdate)
}
sample_string <- '*-5mo+3d+4h+2m'
strTimeConverter(sample_string)

This works but is not very elegant and will likely fail as I'm forced to incorporate other expressions (e.g. yd for day of the year e.g. 124).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Gautam
  • 2,597
  • 1
  • 28
  • 51

2 Answers2

1

You can use backreferences in the replacements like this:

library(stringr)
x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo' = '\\1*lubridate::months(1)', '(\\d+)d' = '\\1*lubridate::days(1)',  '(\\d+)h' =  '\\1*lubridate::hours(1)', '(\\d+)m' = '\\1*lubridate::minutes(1)')
stringr::str_replace_all(x, repl)
## => [1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
##    [2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"

See the R demo online.

See, for example, '(\\d+)mo' = '\\1*lubridate::months(1)'. Here, (\d+)mo matches and captures into Group 1 one or more digits, and mo is just matched. Then, when the match is found, \1 in \1*lubridate::months(1) inserts the contents of Group 1 into the resulting string.

Note that it might make the replacements safer if you cap the time period match with a word boundary (\b) on the right:

repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo\\b' = '\\1*lubridate::months(1)', '(\\d+)d\\b' = '\\1*lubridate::days(1)',  '(\\d+)h\\b' =  '\\1*lubridate::hours(1)', '(\\d+)m\\b' = '\\1*lubridate::minutes(1)')

It won't work if the time spans are glued one to another without any non-word delimiters, but you have + in your example strings, so it is safe here.

Actually, you can make it work with the function you used, too. Just make sure the backreferences have the $n syntax:

x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
pattern = c('^\\*', '^t', '(\\d+)mo', '(\\d+)d', '(\\d+)h', '(\\d+)m')
replacement = c('Sys.time()', 'Sys.Date()', '$1*lubridate::months(1)', '$1*lubridate::days(1)', '$1*lubridate::hours(1)', '$1*lubridate::minutes(1)')
stringi::stri_replace_all_regex(x, pattern, replacement, vectorize_all=FALSE)

Output:

[1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
[2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks! I'll give it a shot - was wondering if `stringr` has a better method than `stringi` for my use case - looks like it does! – Gautam Dec 18 '20 at 21:25
  • @Gautam Yes, I actually cracked it: you can use `stringi::stri_replace_all_regex`, but the backreference syntax is `$n`, not `\n`. – Wiktor Stribiżew Dec 18 '20 at 21:32
1

Another option to produce the time directly, would be the following:

strTimeConvert <- function(base=Sys.time(), delta="-5mo+3d+4h+2m"){
  mo <- gsub(".*([+-]\\d+)mo.*", "\\1", x)
  ds <- gsub(".*([+-]\\d+)d.*", "\\1", x)
  hs <- gsub(".*([+-]\\d+)h.*", "\\1", x)
  ms <- gsub(".*([+-]\\d+)m.*", "\\1", x)
  out <- base + months(as.numeric(mo)) + days(as.numeric(ds)) + 
          hours(as.numeric(hs)) + minutes(as.numeric(ms))
  out
}
strTimeConvert()
# [1] "2020-07-21 20:32:19 EDT"
DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25