1

I am fairly new to R and was trying to determine if I could use R to help fill in missing values in a number of large data sets I am working with. I'll try to explain it to the best of my abilities.

The data set I am working with has time data in the format HH:MM:SS. It is irregular in that no two data sets have the same time stamps, and the time stamp entries are recording an event over a 2 hour period. It looks something like this.

1. Date,         Time_hms, Event
2. 9/22/2015,    00:00:00, 5
3. 9/22/2015,    00:00:24, 1
4. 9/22/2015,    00:00:24, 4
5. 9/22/2015,    00:01:42, 7
6. 9/22/2015,    00:02:04, 3
8. 9/22/2015,    00:02:35, 2
9. 9/22/2015,    00:03:02, 4

What I would like to do is add in missing rows at intervals of one minute, so that it looks like this.

1. Date,         Time_hms, Event
2. 9/22/2015,    00:00:00, 5
3. 9/22/2015,    00:00:24, 1
4. 9/22/2015,    00:00:24, 4
5. 9/22/2015,    00:01:00, 4     # Summary row to be inserted
6. 9/22/2015,    00:01:42, 7
7. 9/22/2015,    00:02:00, 7     # Summary row to be inserted
8. 9/22/2015,    00:02:04, 3
9. 9/22/2015,    00:02:35, 2
10. 9/22/2015,   00:03:00, 2     # Summary row to be inserted
11. 9/22/2015,   00:03:02, 4

If possible, I would like the the rows to be filled in with the event that occurred during that range.

In trying to solve this, I found and tried this approach Insert rows for missing dates/times. I tried using POSIXct but was unsuccessful because of the date format. I have also considered padr and fill_by_function, but am uncertain if that is the correct approach. Is there a method to work strictly with HH:MM:SS format?

Again, I am only just learning R and am unsure of how to approach this. Any help or suggestions would be greatly appreciated!

Edit: Hopefully I did this correctly. Thank you again!

dput(elements)
structure(list(var1 = c("Date", "9/22/2015", "9/22/2015", "9/22/2015", 
"9/22/2015", "9/22/2015", "9/22/2015", "9/22/2015"), var2 = c("Time_hms", 
"00:00:00", "00:00:24", "00:00:24", "00:01:42", "00:02:04", "00:02:35", 
"00:03:02"), var3 = c("Event", "5", "1", "4", "7", "3", "2", 
"4")), .Names = c("var1", "var2", "var3"), row.names = c(NA, 
8L), class = "data.frame")
s_meli_m
  • 13
  • 4
  • 1
    Can you share your sample data via `dput()`? That makes it copy/pasteable and preserves the data frame structure and column classes. – Gregor Thomas Jan 10 '18 at 23:52
  • as for the strategy (unless the dataset almost exceeds your RAM limits): first, create a new dataframe with only one column populated by your "target" time-series and second, join the two dataframes on the "time" column – Pasqui Jan 11 '18 at 00:21
  • Would a potential solution be to combine date and time to one column, and then use POSIXct? However, I would have to separate date and time afterwards. – s_meli_m Jan 11 '18 at 02:39

1 Answers1

0

Okay, your dput data had the headers in the first column. So we'll address that issue first:

names(elements) = elements[1, ]
elements = elements[-1, ]
elements$Event = as.numeric(elements$Event)

Now we'll convert the dates and times to a POSIX datetime (in a separate vector), then we'll take the full range of the data and round it to the nearest minute. We can then create a sequence of every minute from the first to the last (and omit the date so it's the same format):

time_range = round(range(strptime(paste(elements$Date, elements$Time_hms), format = "%m/%d/%Y %H:%M:%S")), units = "mins")
each_minute = seq(from = time_range[1], to = time_range[2], by = "min")
each_minute = format(each_minute, "%H:%M:%S")

Finally, we merge these results back into the original data, order the rows, and use zoo::na.locf to fill in the missing values with the previous observation.

result = merge(elements, data.frame(Time_hms = each_minute), all = T)
result = result[order(result$Time_hms), ]
result$Date = zoo::na.locf(result$Date)
result$Event = zoo::na.locf(result$Event)
result
#    Time_hms      Date Event
# 1  00:00:00 9/22/2015     5
# 2  00:00:24 9/22/2015     1
# 3  00:00:24 9/22/2015     4
# 4  00:01:00 9/22/2015     4
# 5  00:01:42 9/22/2015     7
# 6  00:02:00 9/22/2015     7
# 7  00:02:04 9/22/2015     3
# 8  00:02:35 9/22/2015     2
# 9  00:03:00 9/22/2015     2
# 10 00:03:02 9/22/2015     4

In general, and especially if your data might include different dates, you might find it easier to work with if you just add a new column to your data with the POSIX datetime object. There's not a good class in R for dealing with times without dates (at least not base R) - but you have dates! And there are lots of functions that work well for dealing with dates, like the seq and round I used in this answer.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Hi Gregor, thank you for taking the time to help me with this! If I may ask another question. When testing script > each_minute = seq(from = time_range[1], to = time_range[2], by = "min") I get the error message "Error in seq.int(0, to0 - from, by) : 'to' must be a finite number". I assume I made a mistake/misinterpreted the script, and tried different formats but have not been able to figure it out. Any suggestions? – s_meli_m Jan 15 '18 at 23:40
  • @s_meli_m Probably you have missing values. Try adding `na.rm = TRUE` to the `range()` call. – Gregor Thomas Jan 16 '18 at 01:47