R: Text Mining, create list of words per document

Question

I am reading in the text from a number of PDFs in a directory. Then, I split these texts into single words (tokens) using the tidytext::unnest_tokens()-function. Can someone please tell me, how I can add an additional column to the test-tibble with the name of the file each word comes from?

library(pdftools)
library(tidyverse)
library(tidytext)

files <- list.files(pattern = "pdf$")
content <- lapply(files, pdf_text)
list <- unlist(content, recursive = TRUE, use.names = TRUE)
df = data.frame(text = list)

test <- df %>% tidytext::unnest_tokens(word, text)

score 2 · Accepted Answer · answered Aug 05 '21 at 23:59

You can try the following. Instead of using unlist with all the files, instead pass the entire list of files to map_df from purrr. Then, you can add a column with filename along with the word column.

library(pdftools)
library(tidyverse)
library(tidytext)

files <- list.files(pattern = "pdf$")

map_df(files, ~ data.frame(txt = pdf_text(.x)) %>%
         mutate(filename = .x) %>%
         unnest_tokens(word, txt))

score 1 · Answer 2 · answered Aug 05 '21 at 23:59

1

You could do:

files <- list.files(pattern = "pdf$")
content <- stack(sapply(files, pdf_text, simplify = FALSE))
df %>% 
   tidytext::unnest_tokens(word, value)

answered Aug 05 '21 at 23:59

Onyambu

67,392
3
24
53

score 1 · Answer 3 · answered Aug 06 '21 at 01:26

the plyr package as a nice function for binding to df and using list names as new columns:

library(pdftools)
library(plyr)
library(tidyverse)
library(tidytext)

files <- list.files(pattern = "pdf$")
content <- lapply(files, pdf_text) 
# set list name acording to files
names(content) <- files 
list <- unlist(content, recursive = TRUE, use.names = TRUE)

# use the acorind function from plyr packages and check the result
plyr::ldply(list)

R: Text Mining, create list of words per document

3 Answers3