0

I have a tbl_df that has several columns that have multiple values in them. I am looking to use the values in the columns to create several columns. After that, I'm looking to summarize the columns.

One way I can go about it is to create several ifelse within a mutate but that seems inefficient. Is there a better way to go about this? I'm thinking that there is probably a dplyr and/or tidyr based solution.

Example of what I'm looking to do is below. It's only a sampling of the data and columns. It doesn't contain all of the columns that I'm looking to create. The summary table will have some sum and mean based columns.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble::tribble(
  ~type,      ~bb_type,           ~description,
  "B",            NA,                 "ball",
  "S",            NA,                 "foul",
  "X",  "line_drive", "hit_into_play_no_out",
  "S",            NA,      "swinging_strike",
  "S",            NA,                 "foul",
  "X", "ground_ball",        "hit_into_play",
  "S",            NA,      "swinging_strike",
  "X",    "fly_ball",  "hit_into_play_score",
  "B",            NA,                 "ball",
  "S",            NA,                 "foul"
)


df <- df %>% 
  mutate(ground_ball = ifelse(bb_type == "ground_ball", 1, 0),
         fly_ball = if_else(bb_type == "fly_ball", 1, 0),
         X = if_else(type == "X", 1, 0),
# not sure if this is the based way to go about counting columns that start with swinging to sum later
         swinging_strike = grepl("^swinging", description))

df
#> # A tibble: 10 x 7
#>    type  bb_type    description       ground_ball fly_ball     X swinging_strike
#>    <chr> <chr>      <chr>                   <dbl>    <dbl> <dbl> <lgl>          
#>  1 B     <NA>       ball                       NA       NA     0 FALSE          
#>  2 S     <NA>       foul                       NA       NA     0 FALSE          
#>  3 X     line_drive hit_into_play_no…           0        0     1 FALSE          
#>  4 S     <NA>       swinging_strike            NA       NA     0 TRUE           
#>  5 S     <NA>       foul                       NA       NA     0 FALSE          
#>  6 X     ground_ba… hit_into_play               1        0     1 FALSE          
#>  7 S     <NA>       swinging_strike            NA       NA     0 TRUE           
#>  8 X     fly_ball   hit_into_play_sc…           0        1     1 FALSE          
#>  9 B     <NA>       ball                       NA       NA     0 FALSE          
#> 10 S     <NA>       foul                       NA       NA     0 FALSE

summary_df <- df %>% 
  summarize(n = n(),
            fly_ball = sum(fly_ball, na.rm = TRUE),
            ground_ball = sum(ground_ball, na.rm = TRUE))

summary_df
#> # A tibble: 1 x 3
#>       n fly_ball ground_ball
#>   <int>    <dbl>       <dbl>
#> 1    10        1           1

In summary, I'm looking to do the following:

  1. Create new columns for all of the values in bb_type and type that counts them
  2. Create a new column that counts the number of values that start with swinging in the description column. I'd like to see an example that chooses another text string from that column and creates a new column with the count as an additional example. Ex. ball
  3. How would I choose my own name while doing what I'm looking to achieve in 1 and 2? Would I have to simply use dplyr::rename after the fact?
Jazzmatazz
  • 615
  • 7
  • 18

3 Answers3

2

Using dplyr and tidyr you can do something like this. First, you can group by the "bb_type" variable by specifying .drop = FALSE, in order dplyr keep NA values. Then, you can count them and get the sum of all counted values and finally use pivot_wider to get the data being displayed in the orientation you are looking for:

library(dplyr)
library(tidyr)
df %>% group_by(bb_type, .drop = FALSE) %>%
  count() %>% 
  ungroup() %>% mutate(Sum = sum(n)) %>% 
  pivot_wider(.,names_from = bb_type,values_from = n) 

# A tibble: 1 x 5
    Sum fly_ball ground_ball line_drive  `NA`
  <int>    <int>       <int>      <int> <int>
1    10        1           1          1     7

Is it what you are looking for ?

dc37
  • 15,840
  • 4
  • 15
  • 32
  • Do you want to count the number of values in "type" and "description" variables ? – dc37 Feb 08 '20 at 18:13
  • For type I’d like to count the values. How could I do that, while also changing the created column name without having to do a rename after? For description, I’m looking to only count and create specific values like swinging, and foul. – Jazzmatazz Feb 08 '20 at 18:18
  • @Jazzmatazz I think you are adding requirements that are as yet vaguely described. If you want the "description" to be summarized more compactly than it's values, then you need to give specific rules for that action. – IRTFM Feb 08 '20 at 18:20
  • Sorry, I have trouble to understand what you are looking for. Can you edit your question to provide what is your expected output (including the output for description and type column) ? – dc37 Feb 08 '20 at 18:25
  • It would make more sense to load `tidyr` than `tidyverse` since you're already loading `dplyr`. Also, `group_by(bb_type, .drop = FALSE) %>% count() %>% ungroup()` can just be `count(bb_type, .drop = FALSE)`. Finally, on the last line, the `.,` is superfluous, and it can just be `pivot_wider(names_from = bb_type, values_from = n)` – Phil Feb 08 '20 at 18:43
  • @Phil, you're right for `tidyr` my mistake ! Regarding the `.,` I know that is useless but I like to keep it as a reminder of the definition of the dataframe. – dc37 Feb 08 '20 at 18:49
  • @Phil, and dc37. Any thoughts based on my amended question? – Jazzmatazz Feb 08 '20 at 19:10
  • 1
    @Jazzmatazz I don't understand how your question is not answered by what's already been provided. It would be helpful if you could provide what the desired final output would look like given your data example. – Phil Feb 08 '20 at 19:57
2

We can use table with addmargins from base R

addmargins(table(df$bb_type, useNA = 'always'), 1)
#   fly_ball ground_ball  line_drive        <NA>         Sum 
#          1           1           1           7          10 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @Jazzmatazz Do you need `lapply(df, function(x) addmargins(table(x, useNA = 'always'), 1))` – akrun Feb 08 '20 at 20:04
2

This appears to be a request for a tabulation with a subsequent count of the entries in that tabulation

tb_df <- table(df$bb_type, useNA="always") 

c(Sum=sum(tb_df), tb_df)
        Sum    fly_ball ground_ball  line_drive        <NA> 
         10           1           1           1           7 

If you wanted it as a dataframe you would first turn it into a named list:

data.frame( as.list(  c(Sum=sum(tb_df), tb_df) ) )
  Sum fly_ball ground_ball line_drive NA.
1  10        1           1          1   7

If you wanted this done on all columns then first make a function that handles one column and lapply it to the tbl_df:

tally_col <- function(x){ tb <- table(x, useNA="always") 
 tal <- c(Sum=sum(tb), tb); data.frame( as.list(tal)) }

lapply(df, tally_col)
# ---output---
$type
  Sum B S X NA.
1  10 2 5 3   0

$bb_type
  Sum fly_ball ground_ball line_drive NA.
1  10        1           1          1   7

$description
  Sum ball foul hit_into_play hit_into_play_no_out hit_into_play_score swinging_strike NA.
1  10    2    3             1                    1                   1               2   0
IRTFM
  • 258,963
  • 21
  • 364
  • 487