Questions tagged [pdftools]

An R package for Text Extraction, Rendering and Converting of PDF Documents

Utilities based on 'libpoppler' for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.

97 questions
1
vote
0 answers

pdf_combine() file not searchable

pdf_combine() is a very useful function in pdftools package to combine separate pdf's to one document. How ever, it seems that combined pdf is NOT searchable with Acrobat Reader, even if separate pdf files as them selves are searchable. Search…
Jason
  • 11
  • 1
1
vote
1 answer

Extract text from multiple PDF-files to a structured data table

I am new to this platform and I hope someone can help me. I have imported some pdf files into Rstudio using the pdftools library. Now I want to make structured columns of this text. I just can't seem to get the structure right. This is an example of…
JorisK
  • 11
  • 1
1
vote
1 answer

R: extract dates and numbers from PDF

I'm really struggling to extract the proper information from several thousands PDF files from NTSB (some Dates and numbers to be specific); these PDFs don't require to be OCRed and each report is almost identical in length and layout information. I…
Andrei Niță
  • 517
  • 1
  • 3
  • 14
1
vote
0 answers

How to install poppler 0.73.0 and pdftools in Debian?

I have been tirelessly trying to install a more recent version of poppler on my Debian (9.13 stretch) machine. Even though im able to compile, for some reason installing pdftools ends with errors. I will appreciate any help given Here is what i have…
Andres Mora
  • 1,040
  • 8
  • 16
1
vote
1 answer

I have two sets of pdf from different folders that i went to join as one based on the same name and output in the same folder of first pdf group

I have two folder directory directory1<-"C:/Folder1/" directory2<-"C:/Folder2/" Folder 1 contains file "123456.pdf", "234567.pdf", "345678.pdf", "456789.pdf" Folder 2 contains file "123456_Jon.pdf","234567_Mike.pdf",…
user35131
  • 1,105
  • 6
  • 18
1
vote
0 answers

pdftools::pdf_text() error reading in file

I am having an issue using R/Rstudio reading in a pdf file using the pdftools::pdf_text() function. dat <- pdf_text("Summary Payroll Register BY ENTITY SM HLM ONLY 081321.pdf") Error in normalizePath(path.expand(path), winslash, mustWork) :…
Chris Kiniry
  • 499
  • 3
  • 13
1
vote
0 answers

Using R to read checkbox values in PDF files

I have a number of PDF files with data in checkbox form. I need to read these checkbox values (selected/not selected), but I am unable to figure out how to do this in R. Any help would be greatly appreciated. A sample PDF is here.
callivdw
  • 11
  • 4
1
vote
1 answer

Cleaning downloaded pdf dataset in R

I have downloaded the pdf file from this site (from the Table tab) and want to clean the dataset in R and convert it to a csv or excel file. I am using pdftools package and have downloaded the other required packages. I want to focus on the data for…
OGC
  • 244
  • 3
  • 13
1
vote
1 answer

Read Multiple PDFs into a dataframe in R

I have a folder of PDFs for example foo1.pdf, foo2.pdf, foo3.pdf. I would like to read these pdfs in Rstudio and create a dataframe with 2 columns for the document name and the corresponding text. For example: Document <- c("foo1","foo2","foo3") …
R noob
  • 495
  • 3
  • 20
1
vote
1 answer

Read PDF table into R where rows have varying numbers of lines

I'm hoping to read the following PDF into a tidy data frame within R: PDF Table. The table even stretches across 70+ pages. I am adept at reading in tables where each cell has one line, but I'm not sure how to extend that knowledge to cases where…
Trent
  • 771
  • 5
  • 19
1
vote
1 answer

The text is not recognized from png using Tesseract

I have to pull data from a pdf uploaded at a URL. The pdf is in an image/.png format hence while using the tesseract package few of the lines were not recognized. The…
1
vote
1 answer

Filename too long when using keyword_search to detect pdf?

I am trying to do some text mining of a pdf by searching for certain keywords. This is my code: library(pdftools) library(tidyverse) library(pdfsearch) UC_text <-…
Jane
  • 385
  • 4
  • 11
1
vote
1 answer

Trying to extract a subset of pages from each pdf in a directory with 70 pdf files

I am using tidyverse, tidytext, and pdftools. I want to parse words in a directory of 70 pdf files. I am using these tools to do this successfully but the code below grabs all the pages instead of the subset I want. I need to skip the first two…
1
vote
2 answers

pdf_text function not releasing ram (on windows)

pdf_text() is not releasing RAM. Each time the function runs, it uses more RAM, and doesn't free it up until the R session is terminated. I am on windows. Minimal example # This takes ~60 seconds and uses ~500mb of RAM, which is then unavailable for…
stevec
  • 41,291
  • 27
  • 223
  • 311
0
votes
0 answers

Fastest way to use R to split long pdf into separate pdfs of n pages each

I have a PDF that is over 6,000 pages long. I would like to split it into separate pdfs that are each 50 pages long (or any other length I choose), and save it to an output folder. I wrote the following code, but it is extremely slow, and took an…
user3710004
  • 511
  • 1
  • 6
  • 15