I'm trying to extract a line of text from the first page of each multi-page PDF file in a list of PDFs. I'm trying to get the text into a dataframe so I can extract the author of each PDF, which is on the first page and the same word precedes the author in every single document.
I found the resource below by Packt Publishing that gets very close to what I'm trying to do, but when I implement the for loop (I just copied and pasted and plugged in my object names), R throws this error:
For loop:
text_df <- data.frame(matrix(ncol=2, nrow=0))
colnames(text_df) <- c("pdf title", "text")
for (i in 1:length(vector)){
print(i)
pdf_text(paste("folder/", vector[i],sep = "")) %>%
strsplit("\n")-> document_text
data.frame("pdf title" = gsub(x =vector[i],pattern = ".pdf", replacement = ""),
"text" = document_text, stringsAsFactors = FALSE) -> document
colnames(document) <- c("pdf title", "text")
text_df <- rbind(text_df,document)
}
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 50, 60, 11
Could someone help me understand what this error means? Could someone direct me to other resources that accomplish what I'm trying to do? Thank you in advance!
Resource: https://www.r-bloggers.com/2018/01/how-to-extract-data-from-a-pdf-file-with-r/