0

Currently working with a data set of Reddit comments all taken from Christmas Day, 2017:

load('reddit_xmas_2017.RData')
reddit %>% print

# A tibble: 100,000 x 3
   author        body                                        created_utc        
   <chr>         <chr>                                       <dttm>             
 1 br_shadow     Thank you for this, there is a person writ… 2017-12-25 15:49:08
 2 Ksalol        They are not to quick actually. It's mainl… 2017-12-25 17:42:50
 3 itscool83     tell her you guys should hang out when you… 2017-12-25 18:54:13
 4 Glu7enFree    "Autism is a high honor in the tech savvy … 2017-12-25 07:48:17
 5 Theotheogrea… "You thought a cat was your son?! "         2017-12-25 20:58:08
 6 Shadrac121    Hopfully she takes wat people say in and m… 2017-12-25 22:27:31
 7 1fzUjhemoSB1… Si ce propui sa facem cu toata pielea rama… 2017-12-25 07:41:31
 8 MinisterOfEd… "I don't mean to be impolite, but if you'r… 2017-12-25 19:28:35
 9 AabidS10      i dont have a 720p x265 of it, sorry. i be… 2017-12-25 13:20:32
10 S3RG10        "I'm dying to try Guatemalan sandals and w… 2017-12-25 00:48:46
# … with 99,990 more rows

I am trying to condense this data down so that I have hourly counts of both the words "snow" and "flakes".

I have the following code that gives me an hour and a count (called "body") of mentions of either snow or flake. The line with count struck out is where my question lies.

reddit %>%
    mutate(body = str_count(str_to_lower(body), "snow|flakes")) %>%
    filter(body != 0) %>%
    select(created_utc, body) %>%
    separate(created_utc, c("year", "month", "day"), "-") %>%
    separate(day, c("day", "time"), " ") %>%
    separate(time, c("hour", "minute", "second"), ":") %>%
    select(hour, body) %>%
    arrange(hour) %>%
    #count(hour, body, sort = T) %>%
    print(n = 20)

# A tibble: 271 x 2
   hour   body
   <chr> <int>
 1 00        1
 2 00        1
 3 00        1
 4 00        3
 5 00        1
 6 00        1
 7 00        1
 8 00        1
 9 00        1
10 00        1
11 00        1
12 00        2
13 00        1
14 00        1
15 00        1
16 00        1
17 01        1
18 01        2
19 01        1
20 01        1
# … with 251 more rows

Is there a way to sum these like rows up using the count function? Or, if I was to just use something like summarise, how could I do that?

I hope to get a table that simply has hour (0-23) and body as a summed count of each hour observation, looking like this:

# A tibble: 24 x 2
   hour   body
   <chr> <int>
 1 00        sum(body at hour == 0)
 2 01        sum(body at hour == 0)
 3 02        etc... 
 4 03        etc... 

I am currently only working with tidyverse, stringr, and tidytext, so I do not have the option of using data.table.

PageSim
  • 143
  • 1
  • 1
  • 8

0 Answers0