0

I am trying to complete a dataframe which has gaps. The data is structured hourly and there are some jumps between hours. This is a sample of the dataframe:

sms <- read.csv("hoursms2.csv", sep = ";", header = FALSE)
sms$date <- strptime(sms$date,"%d/%m/%Y %H:%M")

date                      count

2015-12-17 22:00:00        5
2015-12-18 09:00:00        7
2015-12-18 10:00:00        3
2015-12-18 12:00:00        6

What I've tried is to create another dataframe which ranges from the beginning to the end of the first dataframe, but with all the hours and with a unique identifier to keep track of it:

hh <- seq(min(sms$date), max(sms$date), by="hour")
hh <- as.data.frame(hh)
hh <- cbind(hh,seq(1:3292))   #length of the hh data frame

Which gives the following:

date                     id
2015-12-04 12:00:00      1
2015-12-04 13:00:00      2
2015-12-04 14:00:00      3 
2015-12-04 15:00:00      4
2015-12-04 16:00:00      5

So far I have tried these three things I found on StackOverflow, here ,but they do not work in this case:

res <- merge(sms,hh,by = "date", all = TRUE)
res2 <- res[duplicated(res$date), ]
res3 <- res[!duplicated(res), ]

The first commmand, res, what does is to duplicate all the cells, it doesn't merge it by date, so it gives the following:

2015-12-23 19:00:00      5
2015-12-23 19:00:00      NA

The other screw up everything. I think it has something to do with the date format which is not correctly associated, but not sure. Any suggestion of how to complete by dates my data frame please?

Community
  • 1
  • 1
adrian1121
  • 904
  • 2
  • 9
  • 21
  • One of the ways is to use `left_join()` from `dplyr` package. You do `left_join(hh, sms)` and that should give you what you want, I think. – Gopala May 13 '16 at 12:52
  • @Gopala Just tried it and gives the following error: Joining by: "date" Error: cannot join on columns 'date' x 'date': Can't join on 'date' x 'date' because of incompatible types (POSIXct, POSIXt / POSIXlt, POSIXt) – adrian1121 May 13 '16 at 12:55

1 Answers1

1

Using library dplyr, do the following making sure you have proper types in both data frames.

Here is the way I set up the two data frames:

str(sms)
'data.frame':   4 obs. of  2 variables:
 $ date : POSIXct, format: "2015-12-17 22:00:00" "2015-12-18 09:00:00" "2015-12-18 10:00:00" ...
 $ count: int  5 7 3 6
str(hh)
'data.frame':   15 obs. of  2 variables:
 $ date: POSIXct, format: "2015-12-17 22:00:00" "2015-12-17 23:00:00" "2015-12-18 00:00:00" ...
 $ id  : int  1 2 3 4 5 6 7 8 9 10 ...

Here is the left join code:

library(dplyr)
left_join(hh, sms)
Joining by: "date"
                  date id count
1  2015-12-17 22:00:00  1     5
2  2015-12-17 23:00:00  2    NA
3  2015-12-18 00:00:00  3    NA
4  2015-12-18 01:00:00  4    NA
5  2015-12-18 02:00:00  5    NA
6  2015-12-18 03:00:00  6    NA
7  2015-12-18 04:00:00  7    NA
8  2015-12-18 05:00:00  8    NA
9  2015-12-18 06:00:00  9    NA
10 2015-12-18 07:00:00 10    NA
11 2015-12-18 08:00:00 11    NA
12 2015-12-18 09:00:00 12     7
13 2015-12-18 10:00:00 13     3
14 2015-12-18 11:00:00 14    NA
15 2015-12-18 12:00:00 15     6

Note: I had to convert column name in hh to date as the code provided in question does not do that. Alternatively, you can specify by parameter in the left_join call.

Gopala
  • 10,363
  • 7
  • 45
  • 77