If you're interested in faster performance and/or using tidy data principles, then you can avoid using the tm package altogether. Check out this chapter of the book on how to convert back and forth from tidy data structures to a document-term matrix.
Here is a guide on how to get started with topic modeling. After your data is in memory (I recommend using readr::read_lines()
with text files), you would do something like this:
library(tidyverse)
library(tidytext)
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help.
#> Papers, resources, and other materials at structuraltopicmodel.com
library(janeaustenr)
austen_sparse <- austen_books() %>% ## austenbooks like the output of read_lines()
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(book, word) %>%
cast_sparse(book, word, n) ## cast_sparse() is what converts to a DTM
#> Joining, by = "word"
topic_model <- stm(austen_sparse, K = 12, verbose = FALSE, init.type = "Spectral")
summary(topic_model)
#> A topic model with 12 topics, 6 documents and a 13914 word dictionary.
#> Topic 1 Top Words:
#> Highest Prob: anne, captain, elliot, lady, wentworth, charles, time
#> FREX: elliot, wentworth, walter, anne, russell, musgrove, louisa
#> Lift: acknowledgement, lyme, benwick, henrietta, musgrove, walter, kellynch
#> Score: elliot, wentworth, walter, russell, musgrove, anne, louisa
#> Topic 2 Top Words:
#> Highest Prob: emma, miss, harriet, weston, knightley, elton, jane
#> FREX: weston, knightley, elton, woodhouse, fairfax, churchill, hartfield
#> Lift: _broke_, elton's, bates, elton, emma's, enscombe, fairfax
#> Score: emma, weston, knightley, elton, woodhouse, fairfax, harriet
#> Topic 3 Top Words:
#> Highest Prob: elinor, marianne, time, dashwood, sister, edward, mother
#> FREX: elinor, marianne, dashwood, jennings, willoughby, brandon, ferrars
#> Lift: 1811, dashwoods, jennings's, palmer, barton, berkeley, brandon
#> Score: elinor, marianne, dashwood, jennings, willoughby, lucy, brandon
#> Topic 4 Top Words:
#> Highest Prob: fanny, crawford, miss, sir, edmund, time, thomas
#> FREX: crawford, edmund, bertram, norris, rushworth, mansfield, julia
#> Lift: _allow_, bertram, crawford, crawford's, norris, rushworth, susan
#> Score: fanny, crawford, edmund, thomas, bertram, norris, rushworth
#> Topic 5 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: abbeys, average, camilla, causeless, closets, convent, cravats
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 6 Top Words:
#> Highest Prob: elizabeth, darcy, bennet, miss, jane, bingley, time
#> FREX: darcy, bennet, bingley, wickham, collins, lydia, lizzy
#> Lift: _accident_, lucas, bennet, bingley, bourgh, collins, darcy's
#> Score: darcy, elizabeth, bennet, bingley, wickham, collins, lydia
#> Topic 7 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: affrighted, andrews, average, blaize, camilla, causeless, closets
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 8 Top Words:
#> Highest Prob: anne, captain, elliot, lady, wentworth, charles, time
#> FREX: elliot, wentworth, walter, anne, russell, musgrove, louisa
#> Lift: alicia, lyme, musgrove, walter, benwick, henrietta, kellynch
#> Score: elliot, wentworth, walter, russell, musgrove, anne, louisa
#> Topic 9 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: alps, andrews, blaize, france, gloucestershire, heroic, heroine
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 10 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: antiquity, france, gloucestershire, heroic, lid, eleanor, eleanor's
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 11 Top Words:
#> Highest Prob: anne, captain, elliot, lady, wentworth, charles, time
#> FREX: elliot, wentworth, walter, anne, russell, musgrove, louisa
#> Lift: archibald, lyme, walter, benwick, henrietta, kellynch, musgrove
#> Score: elliot, wentworth, walter, russell, musgrove, anne, louisa
#> Topic 12 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, anyone's
#> Lift: anyone's, eleanor, eleanor's, heroine, northanger, thorpe's, thorpes
#> Score: catherine, tilney, thorpe, morland, allen, anyone's, isabella
tidy(topic_model)
#> # A tibble: 166,968 x 3
#> topic term beta
#> <int> <chr> <dbl>
#> 1 1 1 1.18e- 4
#> 2 2 1 1.15e-19
#> 3 3 1 5.51e- 5
#> 4 4 1 1.33e-19
#> 5 5 1 4.20e- 5
#> 6 6 1 2.68e- 5
#> 7 7 1 4.20e- 5
#> 8 8 1 1.18e- 4
#> 9 9 1 4.20e- 5
#> 10 10 1 4.20e- 5
#> # … with 166,958 more rows
Created on 2020-03-25 by the reprex package (v0.3.0)