0

I have a data frame (date, stock, price, size) for which I want to aggregate the number of instances, sum the size and get min and max of price. That part works, however, the final output contains multiple rows of the same ticker. For example, DGAZ appears twice in the cand data frame, and it appears twice in the results_all section. In results_all, I would like one row per ticker.

DGAZ in cand:

18 2017-01-18  DGAZ   3.74  836000
19 2017-01-18  DGAZ   3.76  500000

DGAZ in result_all:

        date symbol print_count tot_shares min_price max_price
1 2017-01-18   DGAZ           2    1336000      3.74      3.76
2 2017-01-18   DGAZ           2    1336000      3.74      3.76

Do you know why that is?

Please note, I will ultimately do this for multiple dates, that is why there are 2 do.call statements.

Reproducible code:

cand <- structure(list(date = structure(c(17184, 17184, 17184, 17184, 
17184, 17184, 17184, 17184, 17184, 17184, 17184, 17184, 17184, 
17184, 17184, 17184, 17184, 17184, 17184), class = "Date"), stock = c("AAPL", 
"ABB", "ABEV", "AMTD", "AXTA", "BNDX", "BWX", "BNDX", "BNPQY", 
"BNPQY", "BPESF", "BTG", "BWX", "CLCD", "CMCSA", "CX", "DANDY", 
"DGAZ", "DGAZ"), price = c(120, 22.41, 5.31, 46, 28.1, 54.06, 
26.23, 54.08, 31.79, 31.79, 1.04, 2.86, 26.28, 46.3, 72.7, 8.28, 
12.72, 3.74, 3.76), size = c(2350000L, 500000L, 500000L, 631400L, 
553525L, 748655L, 1347888L, 711454L, 881744L, 881744L, 745808L, 
700000L, 1296627L, 612347L, 840000L, 500000L, 650000L, 836000L, 
500000L)), .Names = c("date", "stock", "price", "size"), row.names = c(NA, 
19L), class = "data.frame")

result_all <- do.call(rbind, lapply(unique(cand$date), function(ind_date){
  temp <- cand[cand$date == ind_date,]
  result_day <- do.call(rbind, lapply(unique(temp$stock), function(ind_stock){
    temp2 <- temp[temp$stock == ind_stock,]
    print(data.frame(date=temp2$date, symbol=temp2$stock, print_count=nrow(temp2), 
                     tot_shares = sum(temp2$size), min_price=min(temp2$price), 
                     max_price=max(temp2$price)))
  }))
}))

print(result_all)

Thank you.

mks212
  • 901
  • 1
  • 18
  • 40
  • 2
    Without digging too deep, why dont you just replace the last line with `unique(result_all)`? – Tonio Liebrand Feb 11 '17 at 23:01
  • 2
    This seems like a lot of work for something that isn't that complicated. dplyr or data.table will make your life a lot simpler, e.g. `library(dplyr); cand %>% group_by(stock, date) %>% summarise(count = n(), tot_shares = sum(size), min_price = min(price), max_price = max(price))` Or use `aggregate`, but summarizing multiple variables still gets complicated. – alistaire Feb 11 '17 at 23:04

0 Answers0