6

I have a time series data with 1 minute increments. I have written a code but with the large amount of data I have (over 1M rows), looping through each line is taking way too long. The data looks something like the below:

t0 = as.POSIXlt("2018-12-23 00:01:00")
t0 = t0+seq(60,60*10,60)
p1 = seq(5,5*10,5)
p2 = seq(7,7*10,7)
m0 = cbind(p1,p2)
rownames(m0) = as.character(t0)

Where it looks something like:

> head(m0)
                    p1 p2
2018-12-23 00:02:00  5  7
2018-12-23 00:03:00 10 14
2018-12-23 00:04:00 15 21
2018-12-23 00:05:00 20 28
2018-12-23 00:06:00 25 35
2018-12-23 00:07:00 30 42

I want to turn this data into 5 seconds increments by adding 11 lines (55 seconds) before each minute with the value carrying over from the latest value. So it would something like:

> new0
                    p1 p2
2018-12-23 00:01:05  5  7
2018-12-23 00:01:10  5  7
2018-12-23 00:01:15  5  7
2018-12-23 00:01:20  5  7
2018-12-23 00:01:25  5  7
2018-12-23 00:01:30  5  7
2018-12-23 00:01:35  5  7
2018-12-23 00:01:40  5  7
2018-12-23 00:01:45  5  7
2018-12-23 00:01:50  5  7
2018-12-23 00:01:55  5  7
2018-12-23 00:02:00  5  7
2018-12-23 00:02:05 10 14
2018-12-23 00:02:10 10 14
2018-12-23 00:02:15 10 14
2018-12-23 00:02:20 10 14
2018-12-23 00:02:25 10 14
2018-12-23 00:02:30 10 14
2018-12-23 00:02:35 10 14
2018-12-23 00:02:40 10 14
2018-12-23 00:02:45 10 14
2018-12-23 00:02:50 10 14
2018-12-23 00:02:55 10 14
2018-12-23 00:03:00 10 14

I am hoping to find some way to do it without using a loop and utilizing the efficient codes in and/or which I am not too familiar with.

I tried using the ave function from base R, but it is not fast enough.

Jaap
  • 81,064
  • 34
  • 182
  • 193
jay2020
  • 451
  • 1
  • 3
  • 12
  • Apart from the question how to achieve this: How about efficiency? You say, looping through each line takes too long as you already have more than 1M rows. Now adding 11 rows per row results in like 12M rows. Hence, wouldn't it be more wise to have the code "using" this data to "fake" the additional 11 rows when actually using it or alter that code altogether? I mean - by adding the 11 lines, you dont *create* any data. You just bloat it. – Xenonite Jan 22 '19 at 10:47

5 Answers5

9

Since you tagged this with data.table:

library(data.table)
dt = as.data.table(m0, keep = T)[, rn := as.POSIXct(rn)]

dt[.(rep(rn, each = 12) - seq(0, 55, 5)), on = 'rn', roll = -Inf][order(rn)]
#                      rn p1 p2
#  1: 2018-12-23 00:01:05  5  7
#  2: 2018-12-23 00:01:10  5  7
#  3: 2018-12-23 00:01:15  5  7
#  4: 2018-12-23 00:01:20  5  7
#  5: 2018-12-23 00:01:25  5  7
# ---                          
#116: 2018-12-23 00:10:40 50 70
#117: 2018-12-23 00:10:45 50 70
#118: 2018-12-23 00:10:50 50 70
#119: 2018-12-23 00:10:55 50 70
#120: 2018-12-23 00:11:00 50 70
eddi
  • 49,088
  • 6
  • 104
  • 155
5

Here's one way to do it in base R. First, convert your data to a data frame with an explicit column for the time stamps:

m0 <- as.data.frame(m0)
m0$t <- t0

   p1 p2                   t
1   5  7 2018-12-23 00:02:00
2  10 14 2018-12-23 00:03:00
3  15 21 2018-12-23 00:04:00
4  20 28 2018-12-23 00:05:00
5  25 35 2018-12-23 00:06:00
6  30 42 2018-12-23 00:07:00
7  35 49 2018-12-23 00:08:00
8  40 56 2018-12-23 00:09:00
9  45 63 2018-12-23 00:10:00
10 50 70 2018-12-23 00:11:00

Then merge this data frame with a 1-column data frame of time differences (0 to 55):

m1 <- merge(m0, data.frame(diff = seq(0, 55, 5)))

And finally, subtract the difference column from the timestamp column to create new values:

m1$t2 <- with(m1, t - diff)

> m1[c(1, 20, 40), ]

   p1 p2                   t diff                  t2
1   5  7 2018-12-23 00:02:00    0 2018-12-23 00:02:00
20 50 70 2018-12-23 00:11:00    5 2018-12-23 00:10:55
40 50 70 2018-12-23 00:11:00   15 2018-12-23 00:10:45
jdobres
  • 11,339
  • 1
  • 17
  • 37
  • 1
    Sorry but it does not produce the desired results. Thank you for your input though. – jay2020 Dec 24 '18 at 17:44
  • 1
    @jay2020 This is the same result as the other one except for column names, ordering, and assigning a local time zone. It also fixes the issue you mention in your comments on the other one. – IceCreamToucan Dec 24 '18 at 18:03
4

A combination of , padr and will get you there. I use lubridate to format the date so it plays nice with padr. padr adds missing date time values to a data frame. Finally using tidyr's fill function to fill the empty values. Note that by default padr has a break on 1 million rows for memory protection, but you can set this value higher.

library(lubridate)
library(padr)
library(tidyr)

df1 <- data.frame(ymd_hms(t0), p1, p2)
df1 <- pad(df1, interval = "5 secs", start_val = lubridate::ymd_hms("2018-12-23 00:01:05"))
df1 <- fill(df1, p1, p2, .direction = "up")

head(df1, 15)
                    t0 p1 p2
1  2018-12-23 00:01:05  5  7
2  2018-12-23 00:01:10  5  7
3  2018-12-23 00:01:15  5  7
4  2018-12-23 00:01:20  5  7
5  2018-12-23 00:01:25  5  7
6  2018-12-23 00:01:30  5  7
7  2018-12-23 00:01:35  5  7
8  2018-12-23 00:01:40  5  7
9  2018-12-23 00:01:45  5  7
10 2018-12-23 00:01:50  5  7
11 2018-12-23 00:01:55  5  7
12 2018-12-23 00:02:00  5  7
13 2018-12-23 00:02:05 10 14
14 2018-12-23 00:02:10 10 14
15 2018-12-23 00:02:15 10 14
tjebo
  • 21,977
  • 7
  • 58
  • 94
phiver
  • 23,048
  • 14
  • 44
  • 56
  • Thank you so much. This is what I was looking for. However, there are missing time stamps in the original data where i don't wish to fill in. For example, you can think of it as after 2018-12-23 00:02:00, 2018-12-23 00:03:00, 2018-12-23 00:05:00, where 00:04 minutes is missing. It is missing for a reason. Is there a way to go around this problem? – jay2020 Dec 24 '18 at 17:51
  • So instead of generating all 5sec intervals using the pad function from the first time point, making 11 5 seconds intervals above each data point I have? – jay2020 Dec 24 '18 at 17:52
  • 1
    @jay2020, in that case the merges from jdobres or the solution from eddi, should work as they don't join on data that is not there. padr fills in all missing data. Now you could delete all values that are missing. But that might defeat the purpose of trying to speed things up. Considering speed I would say eddi's answer would be the best. – phiver Dec 24 '18 at 18:31
2

A base way:

m0 <- as.data.frame(m0)
time <- lapply(as.POSIXct(rownames(m0)), seq, by = "-5 sec", len = 12)
m1 <- cbind(TIME = Reduce(c, time), m0[rep(seq_len(nrow(m0)), each = 12), ])
row.names(m1) <- NULL
head(m1)

#                  TIME p1 p2
# 1 2018-12-23 00:02:00  5  7
# 2 2018-12-23 00:01:55  5  7
# 3 2018-12-23 00:01:50  5  7
# 4 2018-12-23 00:01:45  5  7
# 5 2018-12-23 00:01:40  5  7
# 6 2018-12-23 00:01:35  5  7

Note: The variable TIME in output is reversed.

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
2

Here's a general xts solution that should work for different parameters than what you have specified in your question.

# convert m0 to xts
x0 <- as.xts(m0)

# create empty xts object with observations at all time points you want
nobs <- 11
nsec <- 5
y0 <- xts(, index(x0) - rep(seq_len(nobs) * nsec, each = nrow(x0)))

# merge data with desired index observations
new0 <- merge(x0, y0)
# carry the current value backward
new0 <- na.locf(new0, fromLast = TRUE)

head(new0, 20)
#                     p1 p2
# 2018-12-23 00:01:05  5  7
# 2018-12-23 00:01:10  5  7
# 2018-12-23 00:01:15  5  7
# 2018-12-23 00:01:20  5  7
# 2018-12-23 00:01:25  5  7
# 2018-12-23 00:01:30  5  7
# 2018-12-23 00:01:35  5  7
# 2018-12-23 00:01:40  5  7
# 2018-12-23 00:01:45  5  7
# 2018-12-23 00:01:50  5  7
# 2018-12-23 00:01:55  5  7
# 2018-12-23 00:02:00  5  7
# 2018-12-23 00:02:05 10 14
# 2018-12-23 00:02:10 10 14
# 2018-12-23 00:02:15 10 14
# 2018-12-23 00:02:20 10 14
# 2018-12-23 00:02:25 10 14
# 2018-12-23 00:02:30 10 14
# 2018-12-23 00:02:35 10 14
# 2018-12-23 00:02:40 10 14
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418