3

I want to parse a character with a 5-digit (or more) year, e.g. 10000-01-01 01:00:00 to class POSIXct.

A 4-digit year is of course no problem. For example,

as.POSIXct("1000-01-01  01:00:00")

But with a 5-digit year, e.g. 10000, as.POSIXct errors:

as.POSIXct("10000-01-01  01:00:00") 
# Error in as.POSIXlt.character(x, tz, ...) : 
#      character string is not in a standard unambiguous format

I wonder is there a way to handle year with more than 4 digits in R?

Henrik
  • 65,555
  • 14
  • 143
  • 159
WCMC
  • 1,602
  • 3
  • 20
  • 32
  • 1
    `as.POSIXct()` appears to be limited to 4 digit years. (But I'd be happy to be proved wrong.) However, `x <-Sys.time(); lubridate::year(x) <- 10000; x` gives me (right now!) `[1] "10000-08-02 22:27:40 BST"`, so maybe you need to take an indirect approach to creating your datetime objects. – Limey Aug 02 '22 at 21:29
  • 2
    `lubridate::make_datetime()` seems be a route to creating your date time objects, once you've broken the strings down to their component parts and converted to numeric. As does `base::ISOdatetime()`. – Limey Aug 02 '22 at 21:33
  • `ISOdatetime` doesn't seem to like years beyond 10000 either - `ISOdatetime(year=10000, month=1, day=1, hour=1, min=0, sec=0)` gives `NA` while using `year=9999` works fine. Which is not surprising as it uses `as.POSIXct` inside the function. – thelatemail Aug 02 '22 at 21:39
  • Just a ref to the docs, from `?strptime`: "`%Y`: ... For input, only years `0:9999` are accepted" – Henrik Aug 02 '22 at 21:44
  • @thelatemail. OK. I didn't check `ISOdatetime()`. But `lubridate::make_datetime()` appears to be behave correctly. – Limey Aug 02 '22 at 21:47
  • 1
    I'm curious about the use case for dates beyond year 10000 - is this for astronomy? I'd think for most use cases the year (or year+decimals) would suffice and be easier to work with. – Jon Spring Aug 02 '22 at 23:41
  • Hi Jon, I am finding adjacent time points (YYYY-MM-DD HH-MM-SS) whose interval is greater than 12 hours, for each of ~20k patients. If I find them for each patient (even with `by()` or parallel processing), the calculation time is really long. The trick is to assign each patient a specific year, then finding adjacent time points with >12 hour interval is fast because I only work on one vector. – WCMC Aug 04 '22 at 16:13
  • 2
    @WCMC I get a nagging feeling of an XY problem. Your "trick" seems a bit convoluted. I bet you could speed up your computations by group considerably by switching from `base::by` to `data.table`. Just my 2c. Anyway: good luck! – Henrik Aug 05 '22 at 14:55

1 Answers1

4

There is no problem handling years with 5 digits, it's the as.POSIXct.character function that is the problem here, since it uses strptime, which can only handle years 0-9999.

The following code produces a POSIXct object of the correct date/time:

structure(253402304400, class = c("POSIXct", "POSIXt"))
#> [1] "10000-01-01 01:00:00 GMT"

If you use POSIXlt to construct the date-times, you can assign the year part numerically, then convert to POSIXct, which allows the following function to be defined. It will do the same as as.POSIXct but can handle large years:

as.bigPOSIX <- function(x) {
  y <- as.POSIXlt(sub("^\\d+", "2000", x))
  y$year <- sapply(strsplit(x, "-"), function(a) as.numeric(a[1])) - 1900
  as.POSIXct(y)
}

For example:

as.bigPOSIX(c("10000-01-01 01:00:00", "23456-03-09 12:04:01", 
               "2022-07-05 23:59:59"))
#> [1] "10000-01-01 01:00:00 GMT" "23456-03-09 12:04:01 GMT" 
#> [3] "2022-07-05 23:59:59 GMT" 
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • 1
    Great! In the second line of your function, perhaps a `sub` there as well: `as.numeric(sub("-.*", "", x)) - 1900`? – Henrik Aug 03 '22 at 01:31