Questions tagged [pdftools]

An R package for Text Extraction, Rendering and Converting of PDF Documents

Utilities based on 'libpoppler' for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.

97 questions
0
votes
0 answers

how to read pdf by blocks rather by lines in R using "pdftools"?

With {pdftools} package, we can read pdf into R environment. But it reads by lines rather by blocks. So, when there are multiple columns, the result becomes a mess. For example, we like to have it in this way. but it comes in this way Have tried…
Grec001
  • 1,111
  • 6
  • 20
0
votes
0 answers

I am trying to create a corpus using pdf documents

When I am writing the code, I am getting the following error: PDF error: Unknown Metadata type: 'XMP' PDF error: Unknown Metadata type: 'XMP' corp <- Corpus(URISource(files), readerControl = list(reader = readPDF)) I have saved all…
0
votes
1 answer

How can I use R pdftools and stringr to extract the author's name from the first page of multiple PDF files?

I'm trying to extract a line of text from the first page of each multi-page PDF file in a list of PDFs. I'm trying to get the text into a dataframe so I can extract the author of each PDF, which is on the first page and the same word precedes the…
0
votes
0 answers

Split a PDF file that contains several scanned documents

I have a big pdf file with 100 pages that contains several scanned documents concatenated, I would like to split this big pdf file into smaller ones, each pdf file must contain a document. Is there a way to detect the start and the end of a document…
raph
  • 3
  • 1
  • 3
0
votes
0 answers

Writing pdf Metadata with r

I would like to alter PDF Metadata with R. There are several questions relating to this topic. However, these solutions suggest using pdftk or exiftool which is not an option for me since I cannot download an exe-File to my machine. In addition,…
Seb
  • 5,417
  • 7
  • 31
  • 50
0
votes
1 answer

How does the "pdf-tools" package overrides "dired-find-file" method?

After installing pdf-tools the dired mode opens the pdf file with PDFView mode as major mode. (use-package pdf-tools :ensure t :config (pdf-tools-install t)) How does the pdf-tools package be able to accomplish this? The Help for RET key in…
Talespin_Kit
  • 20,830
  • 29
  • 89
  • 135
0
votes
0 answers

Force utf-8 output in R PDF extraction

I've inherited some R code to extract text from PDF documents. The code snippet below is edited for brevity. I'm new at R development and having difficulty finding detailed documentation of the functions I'm using. library(pdftools) in_path =…
Steve
  • 1,250
  • 11
  • 25
0
votes
0 answers

Extract table from a PDF with multiple headers (R)

I am trying to get some epidemiological data stored in a pdf that is publicly available link. I am just looking at the data in page 9 (right table). What I would like to achieve is to pass the data into a table, but since I have many headers, it's…
Daniel AG
  • 47
  • 7
0
votes
1 answer

Read table from PDF with partially filled column using Pdftools

I've written a function in R using pdftools to read a table from a pdf. The function gets the job done, but unfortunately the table contains a column for notes, which is only partially filled. As a result the data in the resulting table is shifted…
0
votes
0 answers

Use R to change multiple PDFs to texts and put in a dataframe

I have a few hundreds of PDFs, which I need to change to texts. I do not need to save the text files, but, instead, I extract certain sentences from the text files. I have succeeded to do so in a single pdf file using pdftools. Now, I need to be…
0
votes
0 answers

Issues installing PDFTools in R

Error installing package pdftools in R server This is directly related to this one. The first half of the error message I get when I try to install pdftools as is such rm -f RcppExports.o bindings.o pdftools.dll …
Tim Wilcox
  • 1,275
  • 2
  • 19
  • 43
0
votes
0 answers

R:PDFtools-Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure

I am running into the following error while extracting text from PDF documents (macOS). I have multiple pdf files that I am reading from a folder, parsing it and then writing it to a csv file. This worked fine before but I cant figure out what…
akang
  • 566
  • 2
  • 15
0
votes
0 answers

How to randomize PDF page order using pdftools

I am trying to randomize the page order of a 382-page PDF. I've read that the pdftools package may be the way to go, but I'm not sure if it's able to randomize the PDF order. I was thinking of using pdf_subset to split the existing PDF into two and…
jrb
  • 11
  • 1
0
votes
1 answer

Scraping two-column PDF

I try to scrape the texts of hundreds of PDFs for a project. The PDFs have title pages, headers, footers and two columns. I tried the packages pdftools and tabulizer. However, both have their advantages and disadvantages: the pdf_text() function…
Alexander
  • 25
  • 4
0
votes
1 answer

Converting PDF to text with pdftools in R returning empty string

In the following example, the result is empty for every page in the PDF. library(pdftools) rm(list = ls()) setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) url =…
Aveshen Pillay
  • 431
  • 3
  • 13