6

I have a data with following columns:

    Date         CID      FID        rank  
    31/01/17     abc0001  rx180x01    0
    31/01/17     abc0001  rx180x02    0
    31/01/17     abc0001  rx180x03    2
    28/02/17     abc0001  rx180x32    1
    28/02/17     abc0001  rx180x31    0

Each CID has some unique FIDs mapped to it and has some rank for these FIDs. I need to create 2 new columns finalrank and finalFID.

Final rank is the maximum rank of each CID for each date. ie For CID abc0001 ,in the date 31/01/17, the finalrank would be 2. This logic would be applied across all unique CID-date combination.

FinalFID is that FID which has the particular finalrank value for a CID in a date.For eg, in the date 31/01/17, for CID abc0001 , the FID with maximum rank is rx180x03

So my result should look like this:

    Date         CID      FID        rank  finalrank finalFID
    31/01/17     abc0001  rx180x01    0    2         rx180x03
    31/01/17     abc0001  rx180x02    0    2         rx180x03
    31/01/17     abc0001  rx180x03    2    2         rx180x03
    28/02/17     abc0001  rx180x32    1    1         rx180x32
    28/02/17     abc0001  rx180x31    0    1         rx180x32

I have written a code which seems to be elegant and fine but it is not working for very large data. The one I'm dealing with has like 5,000,000. When I run it in R it is showing running an nothing else for such huge dataframes.

       data          = dplyr::group_by(data,CID,date)
       data          = arrange(data,CID,date)
       data          = dplyr::mutate(data, finalrank =max(rank))

       # Id FID of maximum rank
       data   = dplyr::mutate(data, match = FID[match(finalrank ,rank)])
Dom Jo
  • 320
  • 1
  • 3
  • 13

2 Answers2

8
dat%>%
   group_by(Date,CID)%>%
   mutate(finalrank=max(rank),finalFID=FID[which.max(rank)])
# A tibble: 5 x 6
# Groups:   Date, CID [2]
  Date     CID     FID       rank finalrank finalFID
  <fct>    <fct>   <fct>    <int>     <dbl> <fct>   
1 31/01/17 abc0001 rx180x01     0         2 rx180x03
2 31/01/17 abc0001 rx180x02     0         2 rx180x03
3 31/01/17 abc0001 rx180x03     2         2 rx180x03
4 28/02/17 abc0001 rx180x32     1         1 rx180x32
5 28/02/17 abc0001 rx180x31     0         1 rx180x32

using data.table

library(data.table)
setDT(dat)[,c("finalrank","finalFID"):=.(max(rank),FID[which.max(rank)]),by=.(Date,CID)]
dat 
       Date     CID      FID rank finalrank finalFID
1: 31/01/17 abc0001 rx180x01    0         2 rx180x03
2: 31/01/17 abc0001 rx180x02    0         2 rx180x03
3: 31/01/17 abc0001 rx180x03    2         2 rx180x03
4: 28/02/17 abc0001 rx180x32    1         1 rx180x32
5: 28/02/17 abc0001 rx180x31    0         1 rx180x32
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Thanks. But its still running bro – Dom Jo Jul 04 '18 at 20:37
  • Then you would need to use `data.table` instead of `tidyverse`. It seems you have a big dataset – Onyambu Jul 04 '18 at 20:39
  • @DomJo Did you say this example is still running or your whole dataset is still running? – Onyambu Jul 04 '18 at 20:45
  • @DomJo for the datatable solution.. something like: `library(data.table);setDT(dat)[,c("finalrank","finalFID"):=.(max(rank),FID[which.max(rank)]),by=.(Date,CID)];dat` – Onyambu Jul 04 '18 at 20:45
  • Hi. This seems to work very fast. Why didn't you write this as the answer in the first place? – Dom Jo Jul 04 '18 at 21:01
  • @DomJo I did not know you would want a data.table solution since you used `dplyr` which is `tidyverse` – Onyambu Jul 04 '18 at 21:07
  • I just needed a faster solution. That meets the requirement. What is tidyverse. ? – Dom Jo Jul 04 '18 at 21:11
  • Also I'm not familiar with that syntax you used in setDT function. What exactly does setDT do? – Dom Jo Jul 04 '18 at 21:11
  • @DomJo `setDT` sets/changes a dataframe into a datatable. It basically tells r to call by reference rather than make a copy of the dataframe. You can learn more online – Onyambu Jul 04 '18 at 21:17
  • Just modify the answer to show the setDT solution no? It still shows the older one. Can you share some link to setDT? What is this ":=." exactly? – Dom Jo Jul 04 '18 at 21:20
  • @DomJo on your Rconsole type `?data.table` or you can click [here](https://www.rdocumentation.org/packages/data.table/versions/1.11.4) and click on the function you want to know about including `:=` – Onyambu Jul 04 '18 at 21:23
3

An approach using dplyr which is expected to be faster is by arranging data on Date, CID, rank and then taking last. The solution will be as:

library(dplyr)

df %>% mutate(Date = as.POSIXct(Date,format = "%d/%m/%y")) %>%
  group_by(Date, CID) %>%
  arrange(Date, CID, rank) %>%
  mutate(finalrank = last(rank), finalFID=last(FID)) %>%
  as.data.frame()

#         Date     CID      FID rank finalrank finalFID
# 1 2017-01-31 abc0001 rx180x01    0         2 rx180x03
# 2 2017-01-31 abc0001 rx180x02    0         2 rx180x03
# 3 2017-01-31 abc0001 rx180x03    2         2 rx180x03
# 4 2017-02-28 abc0001 rx180x31    0         1 rx180x32
# 5 2017-02-28 abc0001 rx180x32    1         1 rx180x32

Edited: To clarify doubt by @Onyambu that if last row doesn't have max value for rank then also solution works.

df$rank[1] <- 3

df %>% mutate(Date = as.POSIXct(Date,format = "%d/%m/%y")) %>%
  group_by(Date, CID) %>%
  arrange(Date, CID, rank) %>%
  mutate(finalrank = last(rank), finalFID=last(FID)) %>%
  as.data.frame()


#        Date     CID      FID rank finalrank finalFID
# 1 2017-01-31 abc0001 rx180x02    0         3 rx180x01
# 2 2017-01-31 abc0001 rx180x03    2         3 rx180x01
# 3 2017-01-31 abc0001 rx180x01    3         3 rx180x01
# 4 2017-02-28 abc0001 rx180x31    0         1 rx180x32
# 5 2017-02-28 abc0001 rx180x32    1         1 rx180x32

Data:

df <- read.table(text=
"Date         CID      FID        rank  
31/01/17     abc0001  rx180x01    0
31/01/17     abc0001  rx180x02    0
31/01/17     abc0001  rx180x03    2
28/02/17     abc0001  rx180x32    1
28/02/17     abc0001  rx180x31    0",
header = TRUE,  stringsAsFactors = FALSE)
MKR
  • 19,739
  • 4
  • 23
  • 33