0

I have data as follows, including 10 products (a, b, c, ...), and their descriptions (other variables).

I need to report how the summary statistics of other variables (median/proportion) range between products (should be printed as a minimum and maximum for each summary statistic value).

For example:

Which product has the lowest and which the highest median price (need to report only two values, product names do not matter).

Which product has the lowest and highest proportion of "bad" ratings (need to report only two values, product names do not matter).

Is there an easy way to code it? My actual data has 10,000 products and 150 other variables, eyeballing summary tables would kill me.

Data

```{r}
data.frame(
product = rep(letters[1:10], each = 2, times = 500),
price = rnorm(1000, 100, 30),
weight = rnorm(1000, 8, 2),
price_category = rep(c("expensive", "cheap"), each = 4, times = 250),
rating = replicate(1,sample(c("good", "bad"),1000,rep=TRUE)))
```

enter image description here

Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
st4co4
  • 445
  • 3
  • 10

2 Answers2

2

You can look into dplyr package which will make it easy to answer such questions.

  1. Which product has the lowest and which the highest median price
library(dplyr)
df %>%
  group_by(product) %>%
  summarise(median_price = median(price)) %>%
  slice(which.min(median_price), which.max(median_price))

#  product median_price
#  <chr>          <dbl>
#1 f               91.9
#2 e              107. 
  1. Which product has the lowest and highest proportion of "bad" ratings
df %>%
  group_by(product) %>%
  summarise(prop_of_bad_ratings = mean(rating == 'bad')) %>%
  slice(which.min(prop_of_bad_ratings), which.max(prop_of_bad_ratings))

#  product prop_of_bad_ratings
#  <chr>                 <dbl>
#1 j                      0.44
#2 c                      0.55
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you! This seems very promising. Only two more things. Is there a way to summarise all continuous variables at once? Secondly, the "rating" is a categorical variable. How to get the proportion of good vs bad ratings for each product, and then report their min and max? – st4co4 Sep 12 '20 at 12:08
  • 1
    By summarise do you mean to take `median` for all ? You can use `df %>% group_by(product) %>% summarise(across(where(is.numeric), median))` for that. For the second question, you can do the same as answer 2 above for `'good'` rating as well and then combine the results if needed. – Ronak Shah Sep 12 '20 at 12:42
1

You can also use data table. The := operator does in-place assignment (in this case creating a new medians column) by grouping on the price column with the by argument. Then you can just find the min and max values. The .N operator gets the count of the number of data values.

library(data.table)
  
dt <- data.table(data.frame(
    product = rep(letters[1:10], each = 2, times = 500),
    price = rnorm(1000, 100, 30),
    weight = rnorm(1000, 8, 2),
    price_category = rep(c("expensive", "cheap"), each = 4, times = 250),
    rating = replicate(1,sample(c("good", "bad"),1000,rep=TRUE))))
  
dt[, medians := median(price), by=product]

# Highest and lowest median price
dt[c(which.min(medians), which.max(medians)), medians]

# Calculate proportions of each product
dt[, prodcount := .N, by=product]
dt[, percent := 100 * (.N / prodcount), by=.(rating, product)]
bad <- dt[rating == 'bad',]
bad[c(which.min(percent), which.max(percent))]
at80
  • 710
  • 4
  • 8