0

Well, first things first, I'm still a noob and am learning R. I've a a dataset with 0.9 million rows and 36 columns. Of these columns, a column, let's say DATE has dates in string format and an other column, let's say TZ has timezones as strings too.

What I'm trying to do is contract these two columns into one with type POSIXlt date, which has date, time, timezone. Here's my code for trying to get a vector of all the converted dates:

# Let's suppose my data exist in a variable "data" with dates in "DATE" column and timezones in "TZ"

indices <- NULL
dates <- NULL

zones <- unique (data$TZ)

for(i in seq_along(zones)){
indices <<- which(data$TZ==zones[i])
dates <<- c(dates, as.POSIXlt(data$DATE[indices], format = "%m/%d/%Y %H:%M:%S", tz = zones[i]))
}

Now, although there are ~1 million observations, it seems to do the job in 3-4 seconds. Only, that it "seems" to. The result I get is a list with NAs.

It does work when I try to convert a group individually, i.e., store result for every iteration in a different variable, or not run a for loop and do each iteration manually, storing each result in a different variable and, in the end, concatenate it all using c() function.

What am I doing wrong?

0xts
  • 2,411
  • 8
  • 16

1 Answers1

0

For anyone who might stumble here, I figured it.

  1. You can't use c() on a POSIXlt object as it'll convert it into local timezone. (Not the reason for NAs but it's helpful.)
  2. POSIXlt is stored as a list of different variables like mday, zone etc, due to which it's value cannot be used in a data frame element. Instead of POSIXlt, we can use POSIXct as that's internally represented as seconds from 1970-01-01.
  3. Since we'll be replacing a data frame column with dates it's easier to do so with converting it into a tibble using dplyr::as_tibble() and then use dplyr::rbind() to combine the different results.
  4. The reason of NAs being introduced is the lexical scoping in R. I used dates <<- c(dates, as.POSIXlt(data$DATE[indices], format = "%m/%d/%Y %H:%M:%S", tz = zones[i])) due to which, the value of i in zones[i] was NA or unknown.

So, the correct working code is -

dates <- NULL

for (i in seq_along(zones)) {
    indices <- which(data$TZ==zones[i])
    dts <- as.POSIXct(data$BGN_DATE[indices], format = "%m/%d/%Y %H%M", tz = zones[i])
    dates <<- rbind(dates,as_tibble(dts))
}

#Further, to combine the dates into data frame
data <- arrange(data, TZ) %>% mutate(DATEandTime = dates$value) %>% select(-c("DATE","TZ"))

Dharman
  • 30,962
  • 25
  • 85
  • 135
0xts
  • 2,411
  • 8
  • 16