5

Can you please clarify for me when to use the count() versus the n() function?

This will help me to understand why the following codes gave two different outputs.

Programming in R

Code 1

fueleconomy::vehicles %>% 
  distinct(model, make ) %>%
  group_by( model ) %>%
  count() %>%
  filter( n > 1 ) %>%
  arrange( desc( n ))

Code 1 output

A tibble 60 X 2
Groups: model [60]

Code 2

fueleconomy::vehicles %>% 
  distinct(model, make ) %>%
  group_by( model ) %>%
  filter( n() > 1 ) %>%
  arrange( model )

Code 2 output

A tibble 126 X 2
Groups: model [60]

Note: I was expecting the two codes to give the same output but they didn't. So, I'm confused and would like some clarifications of the main difference between the n() and the count() functions. Also, when can one use either in favour of the other?
Can both be used together in certain circumstances?

P.s: I'm a beginner with no programming background and self-learning, so, be gentle.

Thank you in advance for your help.

Amaks
  • 95
  • 2
  • 6
  • 7
    `count` is a dplyr verb so it can be used in a pipeline `BOD %>% count`. It outputs a data frame. `n()` is not a dplyr verb. It can only be used inside another dplyr verb such as inside summarize: `BOD %>% summarize(n = n())` It outputs a numeric scalar. – G. Grothendieck May 06 '21 at 13:01

2 Answers2

6

In dplyr the count() function is equivalent to summarize(n = n()). Because summarize() is called, only one row is returned per model. The summarize() function creates a new column, keeps the grouping variables, and discards other variables (like make in your case).

When you use filter(n() > 1), you are not doing the summarize() operation, so you are returning all rows for each model. This method also does not create the new n column, nor does it discard non-grouping columns.

Ben Norris
  • 5,639
  • 2
  • 6
  • 15
2

You cannot directly compare a function with another. The order/sequence in which a function is applied is important and needs to be considered. It is also important to take note of which function was applied before and after.

In this case, applying count, you get one row for each model. It is an aggregated dataframe.

library(dplyr)

count_data <- fueleconomy::vehicles %>% 
  distinct(model, make ) %>%
  group_by( model ) %>%
  count() %>%
  filter( n > 1 ) %>%
  arrange( desc( n ))

count_data

#   model                   n
#   <chr>               <int>
# 1 Coachbuilder Wagon      3
# 2 Conquest                3
# 3 Laser                   3
# 4 Limousine               3
# 5 Truck 2WD               3
# 6 Truck 4WD               3
# 7 200                     2
# 8 240 DL/240 GL Wagon     2
# 9 300E                    2
#10 300SL                   2
# … with 50 more rows

Note the output. It says that 'Coachbuilder Wagon' occurs 3 times, 'Conquest' occur 3 times and so on.

Now compare it with n() output.

n_Data <- fueleconomy::vehicles %>% 
  distinct(model, make ) %>%
  group_by( model ) %>%
  filter( n() > 1 ) %>%
  arrange( model )

n_Data

#   make                   model              
#   <chr>                  <chr>              
# 1 Audi                   200                
# 2 Chrysler               200                
# 3 Mcevoy Motors          240 DL/240 GL Wagon
# 4 Volvo                  240 DL/240 GL Wagon
# 5 Lambda Control Systems 300E               
# 6 Mercedes-Benz          300E               
# 7 J.K. Motors            300SL              
# 8 Mercedes-Benz          300SL              
# 9 Mercedes-Benz          500SE              
#10 Texas Coach Company    500SE              
# … with 116 more rows

This is not an aggregated dataframe and model still have multiple rows.

How are these two output data related?

sum(count_data$n)
#[1] 126

nrow(n_Data)
#[1] 126
Amaks
  • 95
  • 2
  • 6
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213