-2

I want to see which category occurs most often each day per participant. There are multiple categories which occur each day and I want a new column which states the category which occured mostly that specific day for a specific participant.

I have a column 'user_id', 'date' and a column 'category' (characters). Which code should I use to add a new column which only states the category which has most occurences for a specific user at a specific day?

dput:

structure(list(user_id = c("10257", "10580", "10280", "10202", "10275","10281"),
date = structure(c(1552521600, 1552003200, 1551139200,1551484800, 1552867200, 1552521600), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
better_category = c("Email", "Internet_Browser", "Instant_Messaging","News","Background_Process","Instant_Messaging")),
row.nams = c(176300L, 184332L, 469288L, 119462L, 112507L, 399236L), 
class = "data.frame")
Marly
  • 3
  • 2
  • Can you give some sample data using `dput` so that we can try to test potential solutions? – iod Nov 13 '19 at 14:41
  • @iod I have included a picture of how the data is formatted. The code below does give me the category which occurs most often (4th column 'max'), but it gives the most common category of the whole dataset, not only the most common value for each specific user each day. Do you have any idea how I can fix this? – Marly Nov 13 '19 at 17:37
  • Don't share data as images. Share a bit of your data by calling `dput(head(data))` and pasting it into your question. – iod Nov 13 '19 at 17:49
  • See revised answer below with fake data. I don't see the behaviour you're describing. – iod Nov 13 '19 at 18:08
  • @iod I used the following code as your answer suggests: categorical_data %>% group_by(user_id, date) %>% mutate(max=names(sort(table(better_category), decreasing = TRUE))[1]) But this only adds a column with the most common value, which is most common over the whole dataset instead of the most common value for each day per specific user. How is this possible? – Marly Nov 13 '19 at 18:12
  • Using the dput data you provided, I get just what you required (given the fact that every line has a unique user-date pairing). Each line has a different max (equal to better_category). – iod Nov 13 '19 at 18:13
  • try running `conflicts()` - one of the functions might be masked by another package. – iod Nov 13 '19 at 18:22
  • @iod The answer you are giving is exactly what my output should be like, thank you for that. But it is frustrating that when I am running the exact same code it is not working and it is only taking one category for all days and all users. Not sure how to find out what is causing this. – Marly Nov 13 '19 at 18:25
  • Start from a fresh session of R, and try it. If it still doesn't work, tell me what packages you're using, and what you get when you run `conflicts()`. – iod Nov 13 '19 at 18:31
  • Did you load `plyr` after loading `dplyr`? If so, you're masking `mutate`, so you're not getting the desired result. See the revised code below for a solution. – iod Nov 13 '19 at 18:40
  • 1
    @iod I think that was indeed the problem. The revised code is working! Thank you so much !! – Marly Nov 13 '19 at 18:55

1 Answers1

0

Let's create some data:

require(dplyr)
set.seed(100)
data<-data.frame(user_id=rep(c(1,2,3),10),date=rep(c("tuesday","wednesday","thursday"),each=10),category=(sample(c(1:3),30,replace=TRUE)))

If we arrange it for convenient viewing, we can get this:

    data<-data %>% arrange(user_id,date)
    data
       user_id      date category
    1        1  thursday        3
    2        1  thursday        2
    3        1  thursday        3
    4        1   tuesday        1
    5        1   tuesday        1
    6        1   tuesday        3
    7        1   tuesday        1
    8        1 wednesday        1
    9        1 wednesday        3
    10       1 wednesday        2
    11       2  thursday        2
    12       2  thursday        1
    13       2  thursday        2
    14       2   tuesday        1
    15       2   tuesday        2
    16       2   tuesday        2
    17       2 wednesday        2
    18       2 wednesday        2
    19       2 wednesday        1
    20       2 wednesday        3
    21       3  thursday        2
    22       3  thursday        3
    23       3  thursday        3
    24       3  thursday        1
    25       3   tuesday        2
    26       3   tuesday        2
    27       3   tuesday        2
    28       3 wednesday        3
    29       3 wednesday        3
    30       3 wednesday        2

Now we'll group it by user_id and date, and create a new column called max that takes the most frequent category from each group. We do this using table over `category, which creates a crosstabs of the column for each grouping:

    data %>% group_by(user_id,date) %>% 
      dplyr::mutate(max=names(sort(table(category),decreasing=TRUE))[1])

# A tibble: 30 x 4
# Groups:   user_id, date [9]
   user_id date      category max  
     <dbl> <fct>        <int> <chr>
 1       1 thursday         3 3    
 2       1 thursday         2 3    
 3       1 thursday         3 3    
 4       1 tuesday          1 1    
 5       1 tuesday          1 1    
 6       1 tuesday          3 1    
 7       1 tuesday          1 1    
 8       1 wednesday        1 1    
 9       1 wednesday        3 1    
10       1 wednesday        2 1    
# ... with 20 more rows

As you can see, each user-day grouping gets its own max. In the last example shown her (1-wednesday), there is one of each of the three categories, so the first is selected, i.e. 1.

Here is the result using your dput data (in which every line has a unique user/date pairing):

# A tibble: 6 x 4
# Groups:   user_id, date [6]
  user_id date                better_category    max               
  <fct>   <dttm>              <fct>              <chr>             
1 10257   2019-03-14 00:00:00 Email              Email             
2 10580   2019-03-08 00:00:00 Internet_Browser   Internet_Browser  
3 10280   2019-02-26 00:00:00 Instant_Messaging  Instant_Messaging 
4 10202   2019-03-02 00:00:00 News               News              
5 10275   2019-03-18 00:00:00 Background_Process Background_Process
6 10281   2019-03-14 00:00:00 Instant_Messaging  Instant_Messaging 

So I created an identical table but duplicated the last row twice and then changed one of the categories there to "News", and ran the same code:

# A tibble: 8 x 4
# Groups:   user_id, date [6]
  user_id date                better_category    max               
  <chr>   <dttm>              <chr>              <chr>             
1 10257   2019-03-14 00:00:00 Email              Email             
2 10580   2019-03-08 00:00:00 Internet_Browser   Internet_Browser  
3 10280   2019-02-26 00:00:00 Instant_Messaging  Instant_Messaging 
4 10202   2019-03-02 00:00:00 News               News              
5 10275   2019-03-18 00:00:00 Background_Process Background_Process
6 10281   2019-03-14 00:00:00 News               Instant_Messaging 
7 10281   2019-03-14 00:00:00 Instant_Messaging  Instant_Messaging 
8 10281   2019-03-14 00:00:00 Instant_Messaging  Instant_Messaging 

Note the last three rows.

iod
  • 7,412
  • 2
  • 17
  • 36
  • Code-only answers are considered low quality: make sure to provide an explanation what your code does and how it solves the problem. – help-info.de Nov 13 '19 at 15:52