1

I'd like to create a plot from the Textmining with R web textbook, but with my data. It essentially searches for the top terms per year and graphs them (Figure 5.4: http://tidytextmining.com/dtm.html). My data is a bit cleaner than the one they started with, but I'm new to R. My data has a "Date" column that is in 2016-01-01 format (it's a date class). I only have data from 2016, so I want to do the same thing, but more granular, (i.e. by month or by day)

library(tidyr)

year_term_counts <- inaug_td %>%
extract(document, "year", "(\\d+)", convert = TRUE) %>%
complete(year, term, fill = list(count = 0)) %>%
group_by(year) %>%
mutate(year_total = sum(count))

year_term_counts %>%
filter(term %in% c("god", "america", "foreign", "union", "constitution", 
"freedom")) %>%
ggplot(aes(year, count / year_total)) +
geom_point() +
geom_smooth() +
facet_wrap(~ term, scales = "free_y") +
scale_y_continuous(labels = scales::percent_format()) +
ylab("% frequency of word in inaugural address")

The idea is that I would chose my specific words from my text and see how they change over the months.

Thank you!

Alex
  • 77
  • 1
  • 10
  • Welcome to SO: Have you tried breaking down the `year_term_counts` by function to examine the intermediate steps? Are you building up the results as you expect? It would help us to see some data. – Shawn Mehan Jun 13 '17 at 15:53
  • 2
    You should consider using the `month` function in the `lubridate` package to make an entire column with the month in it. – ccapizzano Jun 13 '17 at 15:54
  • I'll check out the month function, thanks! – Alex Jun 13 '17 at 15:58
  • 1
    I have found `as.character %>% substr(start=6,stop=7) %>% as.numeric` to be a very effective workaround if the date format is consistently YYYY-MM-DD. – mzuba Jun 13 '17 at 16:14

1 Answers1

1

If you want look at smaller units of time, based on a date column that you already have, I would recommend looking at the floor_date() or round_date() function from lubridate. The particular chapter of our book you linked to deals with taking a document-term matrix and then tidying it, etc. Have you already gotten to a tidy text format for your data? If so, then you could do something like this:

date_counts <- tidy_text %>%
    mutate(date = floor_date(Date, unit = "7 days")) %>% # use whatever time unit you want here
    count(date, word) %>%
    group_by(date) %>%
    mutate(date_total = sum(n))

date_counts %>%
    filter(word %in% c("PUT YOUR LIST OF WORDS HERE")) %>%
    ggplot(aes(date, n / date_total)) +
    geom_point() +
    geom_smooth() +
    facet_wrap(~ word, scales = "free_y")
Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • Thanks, Julia! I've been reading your new book. I'm new to R, but it's super helpful. – Alex Jun 14 '17 at 12:52