Questions tagged [pdftools]

An R package for Text Extraction, Rendering and Converting of PDF Documents

Utilities based on 'libpoppler' for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.

97 questions
2
votes
1 answer

Multi column PDF text without tables and footnotes

I'm dealing with PDFs in my research and I wrote a R scraper for some textdata. Everything works fine and I can read the data…
Martin
  • 312
  • 2
  • 15
2
votes
0 answers

R not allowing to write documents, Error in cpp_pdf_select(input, output, pages, password)

I'm trying to use the following code, which is supossed to order the pages of a pdf: ordenar <- function(archivo) { library(pdftools) nombre <- sapply(strsplit(sapply(strsplit(pdftools::pdf_text(archivo), …
Fer
  • 43
  • 3
2
votes
1 answer

Extract Text from a pdf only English text Canadian Legislation R

I'm trying to extract data from a Canadian Act for a project (in this case, the Food and Drugs Act), and import it into R. I want to break it up into 2 parts. 1st the table of contents (pic 1). Second, the information in the act (pic 2). But I do…
2
votes
1 answer

how to change tesseract config to recognize § and apply with pdftools::pdf_ocr_text in R?

I am using pdftools in R to extract text from both scanned and text based PDF files. One problem is with the § character. This is not recognized by tesseract. I looked at the following links: CRAN tesseract package vignette SO link of a similar…
maop
  • 194
  • 14
2
votes
1 answer

Extract text well from a PDF with two columns in R

I am trying to extract the texts of the annual reports of the companies. Its design is in the majority of two columns. So I don't know how to extract it correctly, since in R I with the pdftools package, I extract the first line of the first column…
David Perea
  • 139
  • 3
  • 12
2
votes
1 answer

How to extract specific parts of messy PDFs in R?

I need to extract specific parts of a large corpus of PDF documents. The PDFs are large and messy reports containing all kinds of digital, alphabetic and other information. The files are of different length but have unified content and sections…
Niki
  • 23
  • 3
2
votes
1 answer

Reading a Fixed-Width Multi-Line File in R

I have data from a PDF file that I am reading into R. library(pdftools) library(readr) library(stringr) library(dplyr) results <- pdf_text("health_data.pdf") %>% readr::read_lines() When I read it in with this method, a character vector is…
daneshjai
  • 858
  • 3
  • 10
  • 17
2
votes
2 answers

How to Remove "|" Without Leaving Space from the List in R

I am using the pdf tool to extract data from the scanned file by transforming to png first. Since the pdf tool read from png, there were some punctuations showing up for no reason. I can remove most of them except for "|". Here is my data: c("|…
2
votes
0 answers

Read flowdiagram as sequential text in R

I have a flowdiagram as a PDF. I want to extract the text as sequential array/vector in R. Is there an efficient way to do this? As an example I am looking at whether we can have a vector 1. Start App 2. Speech Input 3. HTTP POSt Request .. ...
NinjaR
  • 621
  • 6
  • 22
2
votes
0 answers

subscript out of bounds. Extracting PDF

I am extracting text from a pdf. Removing punctuation and looking at key repeated words and how often they appear. library(pdftools) library(tm) setwd("S:/Shared Folders/Impact Investing/Investment/Scripts/PDF") files <- list.files(pattern =…
Will
  • 35
  • 5
1
vote
1 answer

R pdftools returning different units for PDF text coordinates

In the package pdftools, there are two functions pdf_data() (which works on pre-OCR'd PDF files) and pdf_ocr_data() (which will OCR a PDF file regardless of whether it is already OCR'd or not). pdf_data() results in a list of tibbles, each with 6…
pseudorandom
  • 142
  • 1
  • 1
  • 10
1
vote
1 answer

Issue with staplr package in R: set_fields returns the error 'Error: All unnamed arguments must be length 1'

I have been trying to generate a pdf file in R using the staplr package. However I have been running into issues whilst trying to run the example code. I keep getting an error (All unnamed arguments must be length 1) when trying to use the…
1
vote
1 answer

How to convert raw lines to df

I need to read a df from a pdf file and here is an example table So far I was able to read the data as raw lines with the following chunk library(pdftools) library(tidyverse) pdf_file <- pdf_text("exm.pdf") raw_df <- pdf_file %>% read_lines()…
Satya Pamidi
  • 143
  • 8
1
vote
1 answer

Load only the names of many pdfs and make data frame

Im need obtain the names of set a many pdf files (36000 files). But only the names not load all object. Finally make a data frame like this: The link of 21 example…
1
vote
1 answer

separating multiple columns into more columns

The text from a pdf I scraped is jumbled up in different elements. Not to mention, it deleted data when it was converted to a data frame. It's really hard to tell where the text should have been split since it seems like I got it correct in the…
bandcar
  • 649
  • 4
  • 11