1

I have a folder of PDFs for example foo1.pdf, foo2.pdf, foo3.pdf.

I would like to read these pdfs in Rstudio and create a dataframe with 2 columns for the document name and the corresponding text. For example:

Document <- c("foo1","foo2","foo3")
 Text <- c("text in foo1", "text in foo2","text in foo3")
DF <- data.frame(Document, Text)

What I have tried so far without success:

setwd("path to files")
library(pdftools)
files <- list.files(pattern="pdf$", full.names=TRUE)
filestext <- lapply(files, pdf_text)
filestextDF <- as.data.frame(matrix(filestext,ncol =2,byrow = F))
names(filestextDF) <- c("Document", "Text")

How would it be possible to achieve this ?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
R noob
  • 495
  • 3
  • 20

1 Answers1

3

You can combine text from each pdf into one string using paste0 and create a dataframe with filename and it's corresponding text.

library(pdftools)
filestextDF <- data.frame(Document = files,
                          text = sapply(files, function(x) 
                                 paste0(pdf_text(x), collapse = ' ')))        
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213