2

I have a data.table with the following three variables

  1. receptionist = id of the reception person
  2. week.no = this week's number (1 means first week)
  3. absent.thisweek = number of days absent this week

Note that we only have data for the weeks receptionist has come for work, if a week is missing means that person was absent that week

For instance, we have two receptionsit 1 and 2 with this work pattern:

dt <- data.table(receptionist =  c(1,1,1,2,2,2,2), week.no = c(1,3,4,5,8,10,11), absent.thisweek = c(1,2,3,4,5,6,6))
>dt receptionist week.no absent.thisweek
1:            1       1               1
2:            1       3               2
3:            1       4               3
4:            2       5               4
5:            2       8               5
6:            2      10               6
7:            2      11               6

Step 1: I need to find out the number of days a receptionist is absent next week, for this I shift the absent.thisweek by one (week) if that information is available in the data, note for the weeks recepetionist has not come, this value will be NA, my dataset has 1 million rows and this is the most inefficient part.

dt[order(receptionist, week.no), absent.nextweek := dt$absent.thisweek[dt$receptionist==receptionist & dt$week.no==week.no+1], by = .(receptionist, week.no)]
>dt receptionist week.no absent.thisweek absent.nextweek
1:            1       1               1              NA
2:            1       3               2               3
3:            1       4               3              NA
4:            2       5               4              NA
5:            2       8               5              NA
6:            2      10               6               6
7:            2      11               6              NA

Step 2: If data is missing it has to be 7 days of absent. For the last week, we will have the absent.nextweek = NA

dt[is.na(absent.nextweek) & week.no != max(week.no, na.rm=T), absent.nextweek:=7]
   
>dt receptionist week.no absent.thisweek absent.nextweek
1:            1       1               1               7
2:            1       3               2               3
3:            1       4               3               7
4:            2       5               4               7
5:            2       8               5               7
6:            2      10               6               6
7:            2      11               6              NA

I am facing two problems: -In Step 1, it is very inefficient and taking long time (about an hr) to run -In Step 2, row 3 should have NA in the last column, which is not the case

Any suggestion to improve the efficiency (preferably data.table) will be very helpful as well as pointing out the mistake in step 2.

The correct answer would be:

>dt receptionist week.no absent.thisweek absent.nextweek
1:            1       1               1               7
2:            1       3               2               3
3:            1       4               3               NA
4:            2       5               4               7
5:            2       8               5               7
6:            2      10               6               6
7:            2      11               6              NA
vivek
  • 301
  • 3
  • 13

1 Answers1

2

Thinking about the problem from scratch, maybe you could do this directly with fifelse() and shift()?

# We assume data is ordered by week. Otherwise you can run 
# setorder(dt, receptionist, week.no)

dt[, 
   absent.nextweek := 
     fifelse(week.no+1L == shift(week.no, -1L),  shift(absent.thisweek, -1L), 7),
   by = receptionist]


#    receptionist week.no absent.thisweek absent.nextweek
# 1:            1       1               1               7
# 2:            1       3               2               3
# 3:            1       4               3              NA
# 4:            2       5               4               7
# 5:            2       8               5               7
# 6:            2      10               6               6
# 7:            2      11               6              NA
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • I checked the solution it works perfectly as well as efficiently. One more question that I wrote -- why is the max() not working and I am getting 7 in the column 3 using my code? @sindri_baldur – vivek Dec 03 '20 at 03:24
  • I seem it is because you are referring to the max overall not by group. – s_baldur Dec 03 '20 at 08:01
  • Thank you. I have accepted the answer. Btw, writing the code by = .(receptionist) does not change the outcome with max. @sindri_baldur – vivek Dec 04 '20 at 02:51