Currently working with a data set of Reddit comments all taken from Christmas Day, 2017:
load('reddit_xmas_2017.RData')
reddit %>% print
# A tibble: 100,000 x 3
author body created_utc
<chr> <chr> <dttm>
1 br_shadow Thank you for this, there is a person writ… 2017-12-25 15:49:08
2 Ksalol They are not to quick actually. It's mainl… 2017-12-25 17:42:50
3 itscool83 tell her you guys should hang out when you… 2017-12-25 18:54:13
4 Glu7enFree "Autism is a high honor in the tech savvy … 2017-12-25 07:48:17
5 Theotheogrea… "You thought a cat was your son?! " 2017-12-25 20:58:08
6 Shadrac121 Hopfully she takes wat people say in and m… 2017-12-25 22:27:31
7 1fzUjhemoSB1… Si ce propui sa facem cu toata pielea rama… 2017-12-25 07:41:31
8 MinisterOfEd… "I don't mean to be impolite, but if you'r… 2017-12-25 19:28:35
9 AabidS10 i dont have a 720p x265 of it, sorry. i be… 2017-12-25 13:20:32
10 S3RG10 "I'm dying to try Guatemalan sandals and w… 2017-12-25 00:48:46
# … with 99,990 more rows
I am trying to condense this data down so that I have hourly counts of both the words "snow" and "flakes".
I have the following code that gives me an hour and a count (called "body") of mentions of either snow or flake. The line with count struck out is where my question lies.
reddit %>%
mutate(body = str_count(str_to_lower(body), "snow|flakes")) %>%
filter(body != 0) %>%
select(created_utc, body) %>%
separate(created_utc, c("year", "month", "day"), "-") %>%
separate(day, c("day", "time"), " ") %>%
separate(time, c("hour", "minute", "second"), ":") %>%
select(hour, body) %>%
arrange(hour) %>%
#count(hour, body, sort = T) %>%
print(n = 20)
# A tibble: 271 x 2
hour body
<chr> <int>
1 00 1
2 00 1
3 00 1
4 00 3
5 00 1
6 00 1
7 00 1
8 00 1
9 00 1
10 00 1
11 00 1
12 00 2
13 00 1
14 00 1
15 00 1
16 00 1
17 01 1
18 01 2
19 01 1
20 01 1
# … with 251 more rows
Is there a way to sum these like rows up using the count function? Or, if I was to just use something like summarise, how could I do that?
I hope to get a table that simply has hour (0-23) and body as a summed count of each hour observation, looking like this:
# A tibble: 24 x 2
hour body
<chr> <int>
1 00 sum(body at hour == 0)
2 01 sum(body at hour == 0)
3 02 etc...
4 03 etc...
I am currently only working with tidyverse, stringr, and tidytext, so I do not have the option of using data.table.