Questions tagged [pdftools]

An R package for Text Extraction, Rendering and Converting of PDF Documents

Utilities based on 'libpoppler' for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.

97 questions
0
votes
1 answer

Error handling when using pdftools in a loop

I am trying to extract certain tables from multiple pdf files but not all the files have that table. How can I use trycatch or similar to skip and proceed to the next file even if the first file does not contain the certain…
Jane
  • 385
  • 4
  • 11
0
votes
1 answer

Using pdftools in R to extract specific table after a string

I have couple of pdfs and I wish to extract the shareholders table. How can I specify such that only table appearing after the string 'TWENTY LARGEST SHAREHOLDERS' is extracted? I tried but was not quite sure of the function…
Jane
  • 385
  • 4
  • 11
0
votes
1 answer

R - Show data in a data frame

With below code I extract data from a pdf file using pdftools: library(pdftools) library(readr) download.file("https://www.stoxx.com/document/Reports/SelectionList/2020/August/sl_sxebmp_202008.pdf","sl_sxebmp_202008.pdf", mode = "wb") txt <-…
CarlosFC
  • 43
  • 6
0
votes
1 answer

I Want to Convert PDF TO IMAGE but I only want single output image which contain all the images and Vector graphics only. I do not want text

Please suggest how can i achieve this with pdfbox ? I tried below code : try { PDDocument document = PDDocument.load(new File(inputFilePath)); PDFRenderer pdfRenderer = new PDFRenderer(document); for (int page = 0; page <…
0
votes
1 answer

Syntax Error in R when adding a loop to read multiple pdf pages

Can anyone help me to find where is my mistake in this piece of code? This is what I am getting: "Error: unexpected '}' in " }"" If I try to run only the chunk under the loop everything is fine but I need this to be process in 50 pages and…
0
votes
0 answers

PDF to DataFrame from multiple pages R

I want to create a full dataframe with a pdf that contains 50 pages. I was able to generate one data frame coming from only one page by removing the titles but I now I need to generate one dataframe for the entire 50 pages ignoring the titles. This…
0
votes
2 answers

How get output file name exactly same to input file name in R. what should be filename formating in pdfconverter in R

I try to output the 1st page of pdf to png using “pdf_convert” function present in pdftools-library. I get the png but the output file name having "image(page number).png". how to get the output file exactly same to the input file name Pdf name:-…
0
votes
1 answer

Recursively(many subdirs) find pdf files and merge into one pdf file (linux, bash)

Surprisingly I have seen many help pages on how to do this, from the same directory. Those that are recursively used don't seem to work for me (the tries below), or require complications I don't want to utilize as I don't understand them (even worse…
nate
  • 269
  • 2
  • 11
0
votes
0 answers

Does not comply with PDF/A when signing a document through Itext 5.5.5

I am working on converting a PDF to PDF/A. I already did this conversion through a paid PDFTools library, the result of the conversion I place it on this page that is responsible for validating whether it complies with the PDA/A standard…
0
votes
1 answer

lapply for pdf file in folder in R

I wouldl ike to read all the .pdf on the desktop, but when I typed the code below, it showed path_mot <- list.files("/Users/wangoe2345/Desktop", "*.pdf") as.list(path_mot) mot <- lapply(path_mot, pdftools::pdf_text) Error in…
Ellen
  • 1
0
votes
0 answers

Locate starting coordinates of a table in R

I am trying to extract information from a portion of a table in R. Example table below... This is just a simple example compared to what I am really dealing with. I am working with a very large table that has a very strange structure and changes…
AyeTown
  • 831
  • 1
  • 5
  • 20
0
votes
1 answer

Why pdf_text from pdftools reads only the first page of each pdf element in my list of pdfs?

I would like to create a dataframe with all the text and title of ech pdf of my pdfs list. I made one for loop but when I open the resulting dataframe I see that not all the text from each pdf have been processed into text, but only the last…
flavinsky
  • 309
  • 4
  • 13
0
votes
1 answer

R: cleaning pdf text

I have pdf text that I need converted into "tidy" format. But I'm unsure about how to read in the pdf text without compromising the information I need. For example: # install pacman package if you require it if (!require("pacman"))…
dano_
  • 303
  • 1
  • 8
0
votes
1 answer

how to land up on the bitstream url from the href link of an html

I am using rvest R package to scrape a PDF file from this webpage but the final link is exposed (as a bitstream url - whatever it is) after I click on the exposed url by name AC1-96-21-01-2011.pdf. The final pdf file is tucked in here hidden from…
Lazarus Thurston
  • 1,197
  • 15
  • 33
0
votes
0 answers

R for loop with file path character list only runs on first file

I have a for loop in R and a character list including the pdf files I am trying to extract data from using the tabulizer package. pdf_list <- list.files("/path") for (i in 1:length(pdf_list)){ extract_tables(paste(pdf_list[i])) ->df …
maribou912
  • 13
  • 3