0

I'm trying to find popular words in a string using R, which is probably easiest to explain with an example.

Taking this as the input (with millions of entries, where each date can appear thousands of times)

        IncorporationDate                          CompanyName
3007931        2003-05-12 OUTLANE BUSINESS CONSULTANTS LIMITED
692999         2013-03-28          AGB SERVICES ANGLIA LIMITED
2255234        2008-05-22           CIDA INTERNATIONAL LIMITED
310577         2017-09-19               FA IT SERVICES LIMITED
2020738        2012-09-03              THE SPARES SHOP LIMITED
2776144        2006-02-03         ANGELVIEW PROPERTIES LIMITED
2420435        2017-10-17                SHANE WARD TM LIMITED
2523165        2014-06-04      THE INDEPENDENT GIN COMPANY LTD
2594847        2015-05-05                  AIA ENGINEERING LTD
2701395        2015-05-27                LAURA BRIDGES LIMITED

I want to find the top 10 most popular words used in each year, with the result looking something like this:

| Year | Top1    | Top1_Count | Top2 | Top2_Count | ...
| ---- | ------- | ---------- | ---- | ---------- | 
| 2017 | LIMITED | 2          | IT   | 1          |
| ...

The closest I've got so far is:

words <- data.frame(table(unlist(strsplit(tolower(df$SText, " "))))

but that loses the year data, only giving a full total across the entire data frame.

I've also played around with summarize from dplyr, but haven't found a way to get it to do what I want.

edit: using the answer from @maurits-evers I've got a bit further, and found the top 10 using this:

top_words_by_year <- words_by_year %>% group_by(year) %>% top_n(n = 10, wt = n)

just trying to figure out how to get it into the shape I need

Thanks

Ian
  • 3
  • 3
  • 1
    Can you use `dput` to add a representative example of your data, and perhaps add the `dplyr` code you've tried so far? – cmaher Mar 25 '18 at 21:30
  • (FYI, *year* is just another column, not another dimension. Another dimension would be something else after 'rows', 'columns'... e.g. 'slice'/'depth'/'layer'/etc) – smci Mar 25 '18 at 21:35
  • @cmaher I've updated the example to include real data - the full dataset contains many more columns but none of them are relevant to the question edit: The only `dplyr` code I still have as a comment in a script is `year_words %>% group_by(Time, Word)%>%summarise(count=n())`, but that doesn't even match the structure of the data frame, so I'm not sure exactly what I've tried (only that none of it really helped me get closer to my goal) – Ian Mar 25 '18 at 21:39
  • Really your issue was just converting plaintext to dataframe. `read.table()` as @MauritsEvers showed. – smci Mar 25 '18 at 22:24

1 Answers1

1

You could do something like this:

library(tidyverse);
df %>%
    mutate(year = format(as.Date(IncorporationDate, format = "%Y-%m-%d"), "%Y")) %>%
    group_by(year) %>%
    mutate(words = strsplit(as.character(CompanyName), " ")) %>%
    unnest() %>%
    count(year, words);
#  year  words             n
#<chr> <chr>         <int>
#1 2003  BUSINESS          1
#2 2003  CONSULTANTS       1
#3 2003  LIMITED           1
#4 2003  OUTLANE           1
#5 2006  ANGELVIEW         1
#6 2006  LIMITED           1
#7 2006  PROPERTIES        1
#8 2008  CIDA              1
#9 2008  INTERNATIONAL     1
#10 2008  LIMITED           1
## ... with 26 more rows

Explanation: Extract year from IncorporationDate, group by year, split CompanyName into words, unnest, and count the number of words per year.


Sample data

df <- read.table(text =
    "IncorporationDate                          CompanyName
3007931        2003-05-12 'OUTLANE BUSINESS CONSULTANTS LIMITED'
692999         2013-03-28          'AGB SERVICES ANGLIA LIMITED'
2255234        2008-05-22           'CIDA INTERNATIONAL LIMITED'
310577         2017-09-19               'FA IT SERVICES LIMITED'
2020738        2012-09-03              'THE SPARES SHOP LIMITED'
2776144        2006-02-03         'ANGELVIEW PROPERTIES LIMITED'
2420435        2017-10-17                'SHANE WARD TM LIMITED'
2523165        2014-06-04      'THE INDEPENDENT GIN COMPANY LTD'
2594847        2015-05-05                  'AIA ENGINEERING LTD'
2701395        2015-05-27                'LAURA BRIDGES LIMITED'", header = T)
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Great, thanks, that's got me a bit closer! With that output, I've managed to find the top 10 for each year using `top_words_by_year <- words_by_year %>% group_by(year) %>% top_n(n = 10, wt = n); `, just need to find a way to change the data frame layout – Ian Mar 25 '18 at 22:27