I am reading in the text from a number of PDFs in a directory.
Then, I split these texts into single words (tokens) using the tidytext::unnest_tokens()
-function.
Can someone please tell me, how I can add an additional column to the test
-tibble with the name of the file each word comes from?
library(pdftools)
library(tidyverse)
library(tidytext)
files <- list.files(pattern = "pdf$")
content <- lapply(files, pdf_text)
list <- unlist(content, recursive = TRUE, use.names = TRUE)
df = data.frame(text = list)
test <- df %>% tidytext::unnest_tokens(word, text)