I'm importing pdf to R in order to do some text analysis. I have a number of pdf files whose names are their publication year (one publication per year).
I would like to create a TermDocumentMatrix after importing them for which the first term "docs" (ie the first column of the tdm) takes the year of the publication rather than the number of the document. Indeed, at the moment the tdm assigns them numbers (1, 2, 3 etc...) when I create it.
Any ideas on how to do it? My code is below.
Thanks!
#creates the list of pdf files to be picked up (from the working directory)
files <- list.files(pattern = "pdf$")
#read the pdf files from the list (number of pages in brackets in front)
new_files <- sapply(files, pdf_text)
#create corpus
new_corp <- Corpus(VectorSource(new_files))
IMF_tdm <- TermDocumentMatrix(new_corp, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global =c(2, Inf))))