3

I am using dplyr on R and I am trying to filter a tibble which contains transactional data.

The columns of my interest are "Country" and "Sales".

I have a lot of countries and for exploration purposes I want to analyze only the TOP 5 countries with most sales.

The trouble here is that if I do some grouping, it will not work for me, as I need all the rows for further analysis purposes (transactional data).

I tried something like:

trans_merch_df %>% group_by(COUNTRY) %>% top_n(n = 5, wt = NET_SLS_AMT)

But it's completely off.

Let's say I have this:

trans_merch_df <- tibble::tribble(~COUNTRY, ~SALE,
                                  'POR',     14,
                                  'POR',     1,
                                  'DEU',     4,
                                  'DEU',     6,
                                  'POL',     8,
                                  'ITA',     1,
                                  'ITA',     1,
                                  'ITA',     1,
                                  'SPA',     1,
                                  'NOR',     50,
                                  'NOR',     10,
                                  'SWE',     42,
                                  'SWE',     1)

The result I am expecting is:

COUNTRY   SALE
POR       14
POR       1
DEU       4
DEU       6
POL       8
NOR       50
NOR       10
SWE       42
SWE       1

As ITA and SPA are not in the TOP 5 of sales.

Thanks a lot in advance.

Cheers!

mgiormenti
  • 793
  • 6
  • 14
spcvalente
  • 136
  • 3
  • 14
  • 1
    To understand what is going wrong with your approach, whenever you `group_by`, think as if you have a separate little data frame for each group, and everything happens to each group and is then recombined. When you `group_by() %>% top_n()`, you're pulling the the top rows **within** every group, not the top 5 groups. – Gregor Thomas Apr 01 '19 at 18:34

2 Answers2

6

A different dplyr possibility could be:

df %>%
 add_count(COUNTRY, wt = SALE) %>%
 mutate(n = dense_rank(desc(n))) %>%
 filter(n %in% 1:5) %>%
 select(-n)


  COUNTRY  SALE
  <chr>   <int>
1 POR        14
2 POR         1
3 DEU         4
4 DEU         6
5 POL         8
6 NOR        50
7 NOR        10
8 SWE        42
9 SWE         1

Or even more concise:

df %>%
 add_count(COUNTRY, wt = SALE) %>%
 filter(dense_rank(desc(n)) %in% 1:5) %>%
 select(-n)
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • Perfect. The first is quite handy as I can keep the rank, it can be useful further. I did it like this: ` trans_merch_df %>% add_count(COUNTRY, wt = NET_SLS_AMT) %>% mutate(rank = dense_rank(desc(n))) %>% filter(rank %in% 1:10) %>% select(-n)` – spcvalente Apr 02 '19 at 09:08
1

Here's an approach using a join.

library(dplyr)
trans_merch_df %>% 
  # First figure the top 5 countries' by total sales, equiv to 
  #    group_by(COUNTRY) %>% summarize(n = sum(NET_SLS_AMT)
  count(COUNTRY, wt = SALE, sort = T) %>%    
  top_n(n = 5, wt = n) %>%

  # now add back orig data for those countries
  left_join(trans_merch_df)

#Joining, by = "COUNTRY"
## A tibble: 9 x 3
#  COUNTRY     n  SALE
#  <chr>   <int> <int>
#1 NOR        60    50
#2 NOR        60    10
#3 SWE        43    42
#4 SWE        43     1
#5 POR        15    14
#6 POR        15     1
#7 DEU        10     4
#8 DEU        10     6
#9 POL         8     8
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Thanks! It works but that is a SQL like approach and I was expecting something more "dplyr-y" or "R-y" :) Good to understand this approach though. Thanks! – spcvalente Apr 02 '19 at 09:06