in R sum the number of rows per factor within another factor, and return 0 when 'specific' data is missing

Question

I'm stuck with a potential issue and I hope you could help me out:)

For example i have the following data table that displays multiple stores and every time a visitor entered the store, the time and date is recorded. This implies that every row/line is 1 visitor that entered the one of the stores.

data <- structure(list(store.ID = c("1", "1", "1", "1", "1", 
"2", "2", "2", "2", "2", "3", "3", "3", 
"3", "3", "4", "4", "4", "4", "4"), Time = structure(c(6L, 
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 
16L, 17L, 18L, 19L, 20L), .Label = c("  12:09:19", "  12:09:25", 
"  13:09:30", "  13:09:35", "  14:09:40", " 12:00:03", " 12:00:09", 
" 12:00:14", " 14:00:25", " 16:00:32", " 12:27:19", " 13:27:25", 
" 14:27:41", " 14:27:46", " 17:27:59", " 12:46:10", " 12:46:19", " 13:46:29", 
" 14:46:39", " 15:46:50"), class = "factor"), Date = structure(c(1351728000, 
1351728000, 1351728000, 1351728000, 1351728000, 1351814400, 1351814400, 
1351814400, 1351814400, 1351814400, 1351814400, 1351814400, 1351814400, 
1351814400, 1351814400, 1351814400, 1351814400, 1351814400, 1351814400, 
1351814400), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("storeID", "Time", "Date"), class = "data.frame", row.names = c(NA, 
-20L))

[EDIT] The stores are open 24/7. Now I would like is to have a solution / way that assigns each visit / row to one of the 24 hour periods in a day (i.e., 09.00-10.00 being 1, 10.00-11.00 being 2, etc). Then I would like to have the number of visitors per hour period over two consecutive days. I would like to be able to separate this for certain fixed factors, e.g., storeID and City (not shown in this example). Also, if no visitors enter the store, I would like the data file to show that within this time interval there was no visitor, which should in this case return 0). [EDIT]

Note that my data file is huge, having over 700k rows.

I hope I made my issue clear.

MvZB

Hi Arun, I changed it in my original post. Of course it does not necessarily have to be a loop, although this is the only way that I'm familiar with in R to solve these kind of things, with a logical function. Other solutions are more than welcome:) — FlyingDutch, Jan 26 '13 at 01:16
I edited my original post so that it more resembles my data set and what I would like R to accomplish. Help is much appreciated, as I'm lost in the forest. Thanks for the three constructive answers, but independent of what method I use to try to achieve my goal, I get stuck with the fact that I cannot discern between different fixed factors (See edited text in original post). And does this zoo package still work when I have two recordings of visitors entering two different stores at exactly the same time? — FlyingDutch, Jan 27 '13 at 20:29

score 2 · Accepted Answer · edited May 23 '17 at 11:58

First method: Using the zoo package as illustrated here very nicely by Dirk. I've explained the code inline. Something like this should do it:

df <- data # I just prefer `df` to `data`
df$storeID <- as.numeric(as.character(df$storeID)) # make sure its numeric
# instantiate the zoo object by providing values corresponding to time
require(zoo)
z <- zoo(as.numeric(as.character(df$storeID)), 
              as.POSIXct(paste(df$Date, df$Time)))
# create output data.frame with all possible timings
open_time <- paste(9:18, "00", "00", sep=":")
open_date <- as.character(unique(df$Date))
out.df <- data.frame(Date = rep(open_date, each=length(open_time)-1), 
                     Start = rep(head(open_time, -1), length(open_date)), 
                     End = rep(tail(open_time, -1), length(open_date)))
# Pointer for matching later
out.df$Pointer <- as.POSIXct(paste(out.df$Date, out.df$Start))
# initialise count to 0
out.df$count <- 0

# aggregate using zoo's magic function!
# the first part contains the storeID and is aggregated by
# the second column which creates hourly interval from the times in z (your data)
# and the third column sums up all values that fall in each hourly interval
agg.out <- aggregate(z, time(z) - as.numeric(time(z)) %% 3600, length)

# once that is done, just match the corresponding times and place them rightly
m.idx <- match( out.df$Pointer, index(agg.out))
out.df$count[!is.na(m.idx)] <- agg.out[m.idx[!is.na(m.idx)]]
out.df <- subset(out.df, select=-c(Pointer))

# and you're done
> out.df
#          Date    Start      End count
# 1  2012-11-01  9:00:00 10:00:00     0
# 2  2012-11-01 10:00:00 11:00:00     0
# 3  2012-11-01 11:00:00 12:00:00     0
# 4  2012-11-01 12:00:00 13:00:00     3
# 5  2012-11-01 13:00:00 14:00:00     0
# 6  2012-11-01 14:00:00 15:00:00     1
# 7  2012-11-01 15:00:00 16:00:00     0
# 8  2012-11-01 16:00:00 17:00:00     1
# 9  2012-11-01 17:00:00 18:00:00     0
# 10 2012-11-02  9:00:00 10:00:00     0
# 11 2012-11-02 10:00:00 11:00:00     0
# 12 2012-11-02 11:00:00 12:00:00     0
# 13 2012-11-02 12:00:00 13:00:00     5
# 14 2012-11-02 13:00:00 14:00:00     4
# 15 2012-11-02 14:00:00 15:00:00     4
# 16 2012-11-02 15:00:00 16:00:00     1
# 17 2012-11-02 16:00:00 17:00:00     0
# 18 2012-11-02 17:00:00 18:00:00     1

Second Method: Without using zoo package drawing idea from Dirk again here. But I use data.table for fast access. Again look to the inline comments for explanation.

require(data.table)
df <- data # I prefer df than data
# create an id column containing only the hours
df$id <- as.numeric(as.POSIXlt(paste(df$Date, df$Time))$hour)
# convert Date to character
df$Date <- as.character(df$Date)

# load package, create input data.table with Date and id as keys
require(data.table)
dt.in <- data.table(df)
setkey(dt.in, "Date", "id")
# get the count of customers / hour / date 
dt.tmp <- dt.in[, .N, by=c("Date", "id")]

# create the output template data.table with Date, Start and End
open_time <- paste(9:18, "00", "00", sep=":")
open_date <- as.character(unique(df$Date))
dt.out <- data.table(Date = rep(open_date, each=length(open_time)-1), 
                     Start = rep(head(open_time, -1), length(open_date)), 
                     End = rep(tail(open_time, -1), length(open_date)))
# create the id again by extracting hour                     
dt.out[, id := as.numeric(as.POSIXlt(paste(Date, Start))$hour)]
setkey(dt.out, "Date", "id")

# merge the two data.tables to get your output
dt.out <- dt.tmp[dt.out, list(Start, End, N)]
dt.out[, id := NULL]

> dt.out
#           Date    Start      End  N
#  1: 2012-11-01  9:00:00 10:00:00 NA
#  2: 2012-11-01 10:00:00 11:00:00 NA
#  3: 2012-11-01 11:00:00 12:00:00 NA
#  4: 2012-11-01 12:00:00 13:00:00  3
#  5: 2012-11-01 13:00:00 14:00:00 NA
#  6: 2012-11-01 14:00:00 15:00:00  1
#  7: 2012-11-01 15:00:00 16:00:00 NA
#  8: 2012-11-01 16:00:00 17:00:00  1
#  9: 2012-11-01 17:00:00 18:00:00 NA
# 10: 2012-11-02  9:00:00 10:00:00 NA
# 11: 2012-11-02 10:00:00 11:00:00 NA
# 12: 2012-11-02 11:00:00 12:00:00 NA
# 13: 2012-11-02 12:00:00 13:00:00  5
# 14: 2012-11-02 13:00:00 14:00:00  4
# 15: 2012-11-02 14:00:00 15:00:00  4
# 16: 2012-11-02 15:00:00 16:00:00  1
# 17: 2012-11-02 16:00:00 17:00:00 NA
# 18: 2012-11-02 17:00:00 18:00:00  1

I do not fully understand what you are doing or how the zoo package works, but it seems that this is what I want. I'll have a closer look at the zoo package to see how it works so I could project your proposed method to my data. Thanks a lot, looks promising! — FlyingDutch, Jan 26 '13 at 02:32
But how can i view this per fixed factor (i.e., store ID or others in my data file) using the zoo package for example? In the end I would like to be able to discern the sum of visitors in each hour bin for multiple factors simultaneously. Help is much appreciated, as I spent hours reading through the zoo package manual but cannot find a way to do it — FlyingDutch, Jan 27 '13 at 20:12

score 2 · Answer 2 · answered Jan 26 '13 at 14:26

Here's a simple solution using lubridate and factors:

library(lubridate)

# Create a single date time variable
dt <- ymd_hms(paste(data$Date, data$Time))
# Extract the day
data$day <- floor_date(dt, "day")
# Extract the hour, converting it into a factor, so we
# get all hours shown
data$hour <- factor(hour(dt), 9:18)

# Count up with table
as.data.frame(table(data[c("day", "hour")]))

in R sum the number of rows per factor within another factor, and return 0 when 'specific' data is missing

2 Answers2