0

I have data in a CSV that looks like this..

CUSIP   BuyDate SellDate    BuyAmount   SellAmount  Profit  DaysHolding Over365Days
037833100   12/1/2015   3/1/2017    45  27  -18 456 1
17275R102   1/28/2016   2/21/2017   28  25  -3  390 1
38259P508   10/29/2015  2/18/2017   39  36  -3  478 1
594918104   3/1/2016    3/2/2017    35  40  5   366 1
68389X105   4/14/2016   2/21/2017   47  37  -10 313 0
037833100   12/11/2015  2/19/2017   46  40  -6  436 1
17275R102   1/12/2016   2/24/2017   29  34  5   409 1
38259P508   12/22/2015  2/20/2017   46  39  -7  426 1
594918104   12/19/2015  2/22/2017   26  36  10  431 1
68389X105   2/13/2016   3/2/2017    33  34  1   383 1
037833100   12/9/2015   2/18/2017   32  37  5   437 1
17275R102   2/13/2016   2/27/2017   48  25  -23 380 1
38259P508   11/30/2015  2/23/2017   45  34  -11 451 1
594918104   11/14/2015  2/27/2017   47  28  -19 471 1
68389X105   2/10/2016   2/17/2017   39  38  -1  373 1
037833100   4/7/2016    3/5/2017    44  29  -15 332 0
17275R102   3/3/2016    2/19/2017   26  36  10  353 0
037833100   11/25/2015  2/17/2017   28  40  12  450 1
037833100   1/10/2016   3/6/2017    35  36  1   421 1
037833100   3/4/2016    2/22/2017   45  25  -20 355 0
38259P508   2/10/2016   3/7/2017    42  40  -2  391 1
38259P509   12/5/2015   2/25/2017   31  39  8   448 1
38259P510   4/7/2016    2/27/2017   27  34  7   326 0
38259P511   3/26/2016   2/17/2017   27  39  12  328 0
17275R102   2/11/2016   2/27/2017   29  39  10  382 1
17275R102   11/24/2015  2/18/2017   45  35  -10 452 1
38259P509   3/29/2016   3/7/2017    46  27  -19 343 0
38259P509   4/5/2016    2/23/2017   38  38  0   324 0
17275R102   2/13/2016   2/26/2017   35  31  -4  379 1
594918104   3/10/2016   3/4/2017    29  28  -1  359 0
17275R102   10/30/2015  2/23/2017   40  30  -10 482 1
17275R102   12/15/2015  3/2/2017    25  38  13  443 1
594918104   2/2/2016    2/22/2017   26  32  6   386 1
594918105   3/8/2016    2/20/2017   26  29  3   349 0
594918106   11/21/2015  3/6/2017    44  38  -6  471 1
594918107   3/21/2016   2/20/2017   48  39  -9  336 0
594918108   12/21/2015  3/5/2017    37  28  -9  440 1
594918109   1/16/2016   3/5/2017    35  33  -2  414 1
594918110   2/8/2016    3/2/2017    41  39  -2  388 1

There are millions of rows in this file. I want to sort all trades, based on CUSIP, and then subtotal results based on Profits and Over365Days. Here is an image of what the final results should look like. I just added some colors for effect.

enter image description here

I'm guessing it should be something like this:

# read csv file
mydata = read.csv("AllTrades.csv")  

# sort by CUSIP, Over365Days
sortdata <- mtcars[order(CUSIP, Over365Days),] 

# aggregate by Profit & 365Days
finalresults <- aggregate(cbind(Profit, Over365Days) ~ CUSIP, data = sortdata, FUN = sum)

I can easily manage small data sets in Excel, but again, I have millions of rows to deal with. Can someone give me some sample code that can do what I described? Thanks to all.

smci
  • 32,567
  • 20
  • 113
  • 146
ASH
  • 20,759
  • 19
  • 87
  • 200
  • 1
    I'm having a little trouble understanding what you're trying to do. So for each `CUSIP` you want to calculate the sum of `Profit` and the sum of `Over365Days`? – bouncyball Mar 09 '17 at 14:53
  • Well, almost. For each CUSIP, calculate sum of Profit for two groups: >=365days and <365days. I think the image illustrates the point. Does that make sense? – ASH Mar 09 '17 at 14:59
  • 3
    I am not sure why you need to sort your table. Doesn't something like `aggregate(mydata$Profit,by=list(CUSIP = mydata$CUSIP, Over365=myData$Over365),sum)` work ? – xraynaud Mar 09 '17 at 15:05
  • 2
    @xraynaud's comment but in formula form: `aggregate(Profit ~ CUSIP + Over365, data = mydata, sum)` – bouncyball Mar 09 '17 at 15:19
  • 1
    There is a typo in the function I posted, but I cannot edit my command. The second myData should be mydata. – xraynaud Mar 09 '17 at 15:21
  • The term for this sort of thing is [tag:aggregate]. And you don't need to presort your dataframe. – smci Mar 09 '17 at 15:36

2 Answers2

2

I had to do some cleaning to reproduce your data but here is the result after the cleaning portion and I named the variable clean_data

> str(clean_data)
'data.frame':   39 obs. of  8 variables:
$ CUSIP      : chr  "037833100" "17275R102" "38259P508" "594918104" ...
$ BuyDate    : chr  "12/1/2015 " "1/28/2016 " "10/29/2015" "3/1/2016  " ...
$ SellDate   : chr  "3/1/2017 " "2/21/2017" "2/18/2017" "3/2/2017 " ...
$ BuyAmount  : num  45 28 39 35 47 46 29 46 26 33 ...
$ SellAmount : num  27 25 36 40 37 40 34 39 36 34 ...
$ Profit     : num  -18 -3 -3 5 -10 -6 5 -7 10 1 ...
$ DaysHolding: num  456 390 478 366 313 436 409 426 431 383 ...
$ Over365Days: num  1 1 1 1 0 1 1 1 1 1 ...

I then replicated 10x the set to give a slightly larger data size so the output would make more sense and named it new_data

new_data <- plyr::ldply(1:10, function(i){
  clean_data
})

Since Excel formatting is not inline with the way we readily access data after manipulating and aggregation, I stored the results in a slightly different fashion than you would in excel(...trust me... i was an analyst for years...it took getting used to but now I will never go back to pivot tables...)

So the steps are for each individual CUSIP we are going to 'chunk' the data set into only those records that belong to that identity set. IE for N unique CUSIP ids, we will create N[m,j] fragmented data subsets by using the split method. We will also wrap this in a loop so that we only have to apply our aggregation formulas once but will apply to each subset. And for each dataframe in the N unique CUSIP ids realm, we will return a list with the 1) Original chunked data 2) The Profit 3) The Total

out_split <- lapply(split(new_data, new_data$CUSIP), function(i){
  list(
    data_subset = i, # This is the data unique to the CUSIP id
    profit_calc = sum(i[['Profit']]), # This is the sum of profits
    total = sum(i[['Over365Days']]) # This is the sum of 365 roll
  )
})

Now we can call our data by accessing the CUSIP id in the list that was returned and find whatever we want. For example:

> out_split$`594918106`
$data_subset
        CUSIP    BuyDate  SellDate BuyAmount SellAmount Profit DaysHolding Over365Days
35  594918106 11/21/2015 3/6/2017         44         38     -6         471           1
74  594918106 11/21/2015 3/6/2017         44         38     -6         471           1
113 594918106 11/21/2015 3/6/2017         44         38     -6         471           1
152 594918106 11/21/2015 3/6/2017         44         38     -6         471           1
191 594918106 11/21/2015 3/6/2017         44         38     -6         471           1
230 594918106 11/21/2015 3/6/2017         44         38     -6         471           1
269 594918106 11/21/2015 3/6/2017         44         38     -6         471           1
308 594918106 11/21/2015 3/6/2017         44         38     -6         471           1
347 594918106 11/21/2015 3/6/2017         44         38     -6         471           1
386 594918106 11/21/2015 3/6/2017         44         38     -6         471           1

$profit_calc
[1] -60

$total
[1] 10

And additionally, we can find just the calculated totals from all with: (Since the totals were stored as the 3rd item in each iteration)

 > sapply(out_split, `[[`, 3)
    037833100 17275R102 38259P508 38259P509 38259P510 38259P511 594918104 594918105 594918106 594918107 594918108 594918109 594918110 68389X105 
           50        80        40        10         0         0        40         0        10         0        10        10        10        20 

And if we wanted to see the profits:

> sapply(out_split, `[[`, 2)
037833100 17275R102 38259P508 38259P509 38259P510 38259P511 594918104 594918105 594918106 594918107 594918108 594918109 594918110 68389X105 
     -410      -120      -230      -110        70       120        10        30       -60       -90       -90       -20       -20      -100 

EDIT

I realized the example output was on a set that only had one record... so to show another and additionnaly how this can be accessed easily:

> out_split$`037833100`$data_subset$Profit
 [1] -18  -6   5 -15  12   1 -20 -18  -6   5 -15  12   1 -20 -18  -6   5 -15  12   1 -20 -18  -6   5 -15  12   1 -20 -18  -6   5 -15  12   1 -20 -18  -6   5 -15  12
[41]   1 -20 -18  -6   5 -15  12   1 -20 -18  -6   5 -15  12   1 -20 -18  -6   5 -15  12   1 -20 -18  -6   5 -15  12   1 -20

Data set recreation steps for anyone wanting to reproduce:

clean_data <- stringi::stri_split_lines('
CUSIP   BuyDate SellDate    BuyAmount   SellAmount  Profit  DaysHolding Over365Days
037833100   12/1/2015   3/1/2017    45  27  -18 456 1
17275R102   1/28/2016   2/21/2017   28  25  -3  390 1
38259P508   10/29/2015  2/18/2017   39  36  -3  478 1
594918104   3/1/2016    3/2/2017    35  40  5   366 1
68389X105   4/14/2016   2/21/2017   47  37  -10 313 0
037833100   12/11/2015  2/19/2017   46  40  -6  436 1
17275R102   1/12/2016   2/24/2017   29  34  5   409 1
38259P508   12/22/2015  2/20/2017   46  39  -7  426 1
594918104   12/19/2015  2/22/2017   26  36  10  431 1
68389X105   2/13/2016   3/2/2017    33  34  1   383 1
037833100   12/9/2015   2/18/2017   32  37  5   437 1
17275R102   2/13/2016   2/27/2017   48  25  -23 380 1
38259P508   11/30/2015  2/23/2017   45  34  -11 451 1
594918104   11/14/2015  2/27/2017   47  28  -19 471 1
68389X105   2/10/2016   2/17/2017   39  38  -1  373 1
037833100   4/7/2016    3/5/2017    44  29  -15 332 0
17275R102   3/3/2016    2/19/2017   26  36  10  353 0
037833100   11/25/2015  2/17/2017   28  40  12  450 1
037833100   1/10/2016   3/6/2017    35  36  1   421 1
037833100   3/4/2016    2/22/2017   45  25  -20 355 0
38259P508   2/10/2016   3/7/2017    42  40  -2  391 1
38259P509   12/5/2015   2/25/2017   31  39  8   448 1
38259P510   4/7/2016    2/27/2017   27  34  7   326 0
38259P511   3/26/2016   2/17/2017   27  39  12  328 0
17275R102   2/11/2016   2/27/2017   29  39  10  382 1
17275R102   11/24/2015  2/18/2017   45  35  -10 452 1
38259P509   3/29/2016   3/7/2017    46  27  -19 343 0
38259P509   4/5/2016    2/23/2017   38  38  0   324 0
17275R102   2/13/2016   2/26/2017   35  31  -4  379 1
594918104   3/10/2016   3/4/2017    29  28  -1  359 0
17275R102   10/30/2015  2/23/2017   40  30  -10 482 1
17275R102   12/15/2015  3/2/2017    25  38  13  443 1
594918104   2/2/2016    2/22/2017   26  32  6   386 1
594918105   3/8/2016    2/20/2017   26  29  3   349 0
594918106   11/21/2015  3/6/2017    44  38  -6  471 1
594918107   3/21/2016   2/20/2017   48  39  -9  336 0
594918108   12/21/2015  3/5/2017    37  28  -9  440 1
594918109   1/16/2016   3/5/2017    35  33  -2  414 1
594918110   2/8/2016    3/2/2017    41  39  -2  388 1
', omit_empty = TRUE)[[1]] %>% 
  stringi::stri_split_regex("\\s+", simplify = TRUE) %>% (function(x){
    col_names <- x[1,]
    a_data <- data.frame(x[2:nrow(x),], stringsAsFactors = FALSE)
    colnames(a_data) <- col_names
    as.data.frame(Map(function(i){
      .call_col <- sprintf("as.%s",readr::guess_parser(i))
      do.call(.call_col, list(i))
    }, a_data))
  })
Carl Boneri
  • 2,632
  • 1
  • 13
  • 15
  • Thanks so much everyone! This is very helpful!! – ASH Mar 09 '17 at 15:29
  • No problem... I spent years married to excel so I know your pains, but once you jump into R it's amazing how much easier life gets... especially when it comes to putting together those pitch-books.... best of luck. – Carl Boneri Mar 09 '17 at 15:31
  • `plyr` is dead since 2014, use `dplyr`! (or `data.table`) – smci Mar 09 '17 at 15:32
  • 1
    @smci you might be interested in a similar question and the benchmarking results I had to reproduce yesterday with a similar method vs. dplyr pipes. http://stackoverflow.com/questions/42672455/spread-multiple-columns-with-values-by-one-columne-containing-key/42676236#42676236 . I should note that `dplyr` and `data.table` are at the epicenter of my everyday life... as is `parallel`, but in these type of cases I always come back to the `base` funs... – Carl Boneri Mar 09 '17 at 15:35
  • @CarlBoneri: interesting post. But let's not confuse the new users with several different paradigms. `plyr` doesn't always work, it often fails with out-of-memory on high-cardinality splits. `dplyr` always works (IME), and is guaranteed to be *reasonably* performant. – smci Mar 09 '17 at 15:38
  • I didn't consider the out-of-memory costs... That's a great point @smci . I haven't faced that in a while... but to be fair I think the only call to `plyr` here was in recreating the data set no? `split` is from base package on my system... and i have dplyr and data.table loaded into the workspace. Am I forgetting something? True curiosity. Also I must admit I am not as well versed in `data.table` as I would like to be... but `dplyr` is like a friendlier wife to me at this point. – Carl Boneri Mar 09 '17 at 15:43
  • @CarlBoneri: for a small/toy example, any code will work (here, cardinality is obviously not an issue). But better to teach new users a flow that is known to be performant and scalable (and actively supported), than one that fails with out-of-memory on a split by user-id, zipcode, etc. or split on multiple variables, and whose package was mothballed four years ago. I agree with you that as/when we discover testcases where dplyr underperforms plyr or base, we should highlight them and file as performance-bugs on dplyr. But keep using dplyr. Hadley is super-responsive to bugs (so is Matt Dowle). – smci Mar 09 '17 at 15:47
  • @smci duly noted and appreciated. I'll keep that in mind moving forward. Thanks for the feedback, sir. – Carl Boneri Mar 09 '17 at 15:48
1

This aggregation is a simple one-liner with dplyr (or data.table) - (and you totally don't need to presort your dataframe):

require(dplyr)

summaryresults <- mydata %>%
    group_by(CUSIP) %>%
    summarize(Profit = sum(Profit), Over365Days = sum(Over365Days)) %>%
    ungroup()
    # %>% arrange(CUSIP, Over365Days) # ...if you want the summary result ordered by those vars

dplyr is the successor to plyr, and superbly easy to use and intuitive, with verbs like mutate, summarize, filter, select, arrange and others. See the introduction or tutorial: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

smci
  • 32,567
  • 20
  • 113
  • 146
  • Anyone else getting ? `Error in summarise_impl(.data, dots) : cannot modify grouping variable` – Carl Boneri Mar 09 '17 at 15:47
  • @CarlBoneri: doh! sorry. We only needed to `group_by(CUSIP)`, not `(CUSIP, Over365Days)`. Obviously we couldn't mutate `Over365Days ` inside a `group_by(Over365Days)`. Please confirm it's fixed. – smci Mar 09 '17 at 15:48
  • Non-smirky wink in your direction , buddy. – Carl Boneri Mar 09 '17 at 15:49
  • @CarlBoneri: if the OP had provided reproducible data, I would have been able to test before posting... :-) It's all good. – smci Mar 09 '17 at 15:50
  • @CarlBoneri: sure. Useful if you can add at bottom of your answer the code showing the OP how to produce reproducible data.. – smci Mar 09 '17 at 16:13
  • I added above for ref. – Carl Boneri Mar 09 '17 at 16:48