Highest Voted 'pdftools' Questions

1

vote

0 answers

pdf_combine() file not searchable

pdf_combine() is a very useful function in pdftools package to combine separate pdf's to one document. How ever, it seems that combined pdf is NOT searchable with Acrobat Reader, even if separate pdf files as them selves are searchable. Search…

r pdf-generation searchable pdftools

asked Feb 03 '22 at 15:21

Jason

11
1

1

vote

1 answer

Extract text from multiple PDF-files to a structured data table

I am new to this platform and I hope someone can help me. I have imported some pdf files into Rstudio using the pdftools library. Now I want to make structured columns of this text. I just can't seem to get the structure right. This is an example of…

r pdf datatable stringr pdftools

asked Jan 27 '22 at 19:53

JorisK

11
1

1

vote

1 answer

R: extract dates and numbers from PDF

I'm really struggling to extract the proper information from several thousands PDF files from NTSB (some Dates and numbers to be specific); these PDFs don't require to be OCRed and each report is almost identical in length and layout information. I…

r stringr readr pdftools

asked Jan 20 '22 at 10:40

Andrei Niță

517
1
3
14

1

vote

0 answers

How to install poppler 0.73.0 and pdftools in Debian?

I have been tirelessly trying to install a more recent version of poppler on my Debian (9.13 stretch) machine. Even though im able to compile, for some reason installing pdftools ends with errors. I will appreciate any help given Here is what i have…

r poppler pdftools

asked Oct 22 '21 at 02:49

Andres Mora

1,040
8
16

1

vote

1 answer

I have two sets of pdf from different folders that i went to join as one based on the same name and output in the same folder of first pdf group

I have two folder directory directory1<-"C:/Folder1/" directory2<-"C:/Folder2/" Folder 1 contains file "123456.pdf", "234567.pdf", "345678.pdf", "456789.pdf" Folder 2 contains file "123456_Jon.pdf","234567_Mike.pdf",…

r pdf pdftools

asked Aug 20 '21 at 21:50

user35131

1,105
6
18

1

vote

0 answers

pdftools::pdf_text() error reading in file

I am having an issue using R/Rstudio reading in a pdf file using the pdftools::pdf_text() function. dat <- pdf_text("Summary Payroll Register BY ENTITY SM HLM ONLY 081321.pdf") Error in normalizePath(path.expand(path), winslash, mustWork) :…

r pdftools

asked Aug 12 '21 at 13:37

Chris Kiniry

499
3
13

1

vote

0 answers

Using R to read checkbox values in PDF files

I have a number of PDF files with data in checkbox form. I need to read these checkbox values (selected/not selected), but I am unable to figure out how to do this in R. Any help would be greatly appreciated. A sample PDF is here.

r pdf pdftools

asked Apr 02 '21 at 20:43

callivdw

11
4

1

vote

1 answer

Cleaning downloaded pdf dataset in R

I have downloaded the pdf file from this site (from the Table tab) and want to clean the dataset in R and convert it to a csv or excel file. I am using pdftools package and have downloaded the other required packages. I want to focus on the data for…

r pdftools

asked Jan 16 '21 at 11:50

OGC

244
3
13

1

vote

1 answer

Read Multiple PDFs into a dataframe in R

I have a folder of PDFs for example foo1.pdf, foo2.pdf, foo3.pdf. I would like to read these pdfs in Rstudio and create a dataframe with 2 columns for the document name and the corresponding text. For example: Document <- c("foo1","foo2","foo3") …

r pdftools

asked Nov 04 '20 at 02:33

R noob

495
3
20

1

vote

1 answer

Read PDF table into R where rows have varying numbers of lines

I'm hoping to read the following PDF into a tidy data frame within R: PDF Table. The table even stretches across 70+ pages. I am adept at reading in tables where each cell has one line, but I'm not sure how to extend that knowledge to cases where…

r pdf pdftools

asked Sep 10 '20 at 18:11

Trent

771
5
19

1

vote

1 answer

The text is not recognized from png using Tesseract

I have to pull data from a pdf uploaded at a URL. The pdf is in an image/.png format hence while using the tesseract package few of the lines were not recognized. The…

image-processing ocr tesseract pdftools propensity-score-matching

asked Apr 06 '20 at 07:13

Ami

197
1
12

1

vote

1 answer

Filename too long when using keyword_search to detect pdf?

I am trying to do some text mining of a pdf by searching for certain keywords. This is my code: library(pdftools) library(tidyverse) library(pdfsearch) UC_text <-…

r text-mining pdftools

asked Feb 15 '20 at 01:32

Jane

385
4
11

1

vote

1 answer

Trying to extract a subset of pages from each pdf in a directory with 70 pdf files

I am using tidyverse, tidytext, and pdftools. I want to parse words in a directory of 70 pdf files. I am using these tools to do this successfully but the code below grabs all the pages instead of the subset I want. I need to skip the first two…

r pdf tidyverse tidytext pdftools

asked Oct 18 '19 at 19:26

Craig Byron

23
7

1

vote

2 answers

pdf_text function not releasing ram (on windows)

pdf_text() is not releasing RAM. Each time the function runs, it uses more RAM, and doesn't free it up until the R session is terminated. I am on windows. Minimal example # This takes ~60 seconds and uses ~500mb of RAM, which is then unavailable for…

r pdftools

asked Jun 22 '19 at 13:04

stevec

41,291
27
223
311

0

votes

0 answers

Fastest way to use R to split long pdf into separate pdfs of n pages each

I have a PDF that is over 6,000 pages long. I would like to split it into separate pdfs that are each 50 pages long (or any other length I choose), and save it to an output folder. I wrote the following code, but it is extremely slow, and took an…

r pdf pdftools

asked Aug 16 '23 at 18:33

user3710004

511
1
6
15

Questions tagged [pdftools]