I'm trying to find popular words in a string using R, which is probably easiest to explain with an example.
Taking this as the input (with millions of entries, where each date can appear thousands of times)
IncorporationDate CompanyName 3007931 2003-05-12 OUTLANE BUSINESS CONSULTANTS LIMITED 692999 2013-03-28 AGB SERVICES ANGLIA LIMITED 2255234 2008-05-22 CIDA INTERNATIONAL LIMITED 310577 2017-09-19 FA IT SERVICES LIMITED 2020738 2012-09-03 THE SPARES SHOP LIMITED 2776144 2006-02-03 ANGELVIEW PROPERTIES LIMITED 2420435 2017-10-17 SHANE WARD TM LIMITED 2523165 2014-06-04 THE INDEPENDENT GIN COMPANY LTD 2594847 2015-05-05 AIA ENGINEERING LTD 2701395 2015-05-27 LAURA BRIDGES LIMITED
I want to find the top 10 most popular words used in each year, with the result looking something like this:
| Year | Top1 | Top1_Count | Top2 | Top2_Count | ... | ---- | ------- | ---------- | ---- | ---------- | | 2017 | LIMITED | 2 | IT | 1 | | ...
The closest I've got so far is:
words <- data.frame(table(unlist(strsplit(tolower(df$SText, " "))))
but that loses the year data, only giving a full total across the entire data frame.
I've also played around with summarize from dplyr, but haven't found a way to get it to do what I want.
edit: using the answer from @maurits-evers I've got a bit further, and found the top 10 using this:
top_words_by_year <- words_by_year %>% group_by(year) %>% top_n(n = 10, wt = n)
just trying to figure out how to get it into the shape I need
Thanks