Questions tagged [pdftools]

An R package for Text Extraction, Rendering and Converting of PDF Documents

Utilities based on 'libpoppler' for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.

97 questions
0
votes
1 answer

How to change tesseract's Page Segmentation Method (PSM) using R?

I would like to read a scanned PDF document into R using tesseract. In general, this already works quite well, but I have problems when the documents have a table structure. After some time of research I found out that there is a parameter to set…
RKF
  • 131
  • 7
0
votes
1 answer

Do I need to use RSelenium to download these PDFs?

I am trying to use rvest and pdftools to go through this page and download the PDFs. I'm having trouble using CSS selector to do this, and wondering if this might take a webdriver? Also, is it easy enough to use a webdriver to do this in R - as a…
paulson
  • 3
  • 1
0
votes
1 answer

How do I combine some vector elements in the same vector using r?

I extracted table from pdf using pdftools in r. The table in PDF has multi-line texts for the columns. I replaced the spaces with more than 2 spaces with "|" so that it's easier. But the problem I'm running into is that because of the multi-line and…
user1828605
  • 1,723
  • 1
  • 24
  • 63
0
votes
1 answer

How to systematically extract data from a textbook

{edited} Hi everyone! I'm attempting to systematically extract data from a textbook (pdf). Because this task doesn't easily translate to reproducible example, I'm providing 2 pages from the book as an example here. These two pages contain a list of…
0
votes
1 answer

How to convert all pages of a pdf into a single page pdf document in R

I have tried exploring pdftools. It does have a pdf_combine() function which stitches multiple pdf to one. However, It doesn't help combine multiple pages of a pdf document into one page.
0
votes
0 answers

How to group and aggregate a data.table based on a range of a variable in r

I have this output from the pdftools pdf_data() for a page of the financial statements of a town. Unfortunately, in rare cases, the capture of a line y is slightly off, as shown below. I would like to be able to group on y including cases where y is…
David Lucey
  • 252
  • 3
  • 9
0
votes
0 answers

How to save .pdf file with correct filename if specific characters is used in pdftools::pdf_subset(), R

I hope someone can help me. I use pdf_subset() from pdftools package to select some pages from .pdf file and save in new .pdf file. However, there is a problem: my path/filename consists of specific characters (polish letters) which are replaced by…
0
votes
1 answer

Reading PDF portfolio in R

Is it possible to read/convert PDF portfolios in R? I usually use pdftools, however, I get an error: library(pdftools) #> Using poppler version 0.73.0 link <-…
ava
  • 840
  • 5
  • 19
0
votes
3 answers

creating a loop for "load" and "save" processes

I have a data.frame (dim: 100 x 1) containing a list of url links, each url looks something like this: https:blah-blah-blah.com/item/123/index.do . The list (the list is a data.frame called my_list with 100 rows and a single column named col and is…
stats_noob
  • 5,401
  • 4
  • 27
  • 83
0
votes
1 answer

How to replace if a value of the column if it starts with character "N" in R

How to replace if a value of the column (GID) starts with char "N" to ColB if the ColB is empty in a Dataframe in R programming code: DataFile <- extract_tables("new.pdf",pages = c(87), method = "stream", output =…
kumar
  • 5
  • 5
0
votes
0 answers

How to merge specific columns with its next column without hardcoding in R programming

How to merge column names that are "X" with its next column without hardcode in R programming X should be merged to Day.7 X.1 should be merged into Day.8 X.2 and X.3 should be merged into Day.9 Code: library(data.table) library(tabulizer) pdf_file…
kumar
  • 5
  • 5
0
votes
2 answers

How to remove column labels if the name of the label starts with "G" in R programming

How to remove column labels if the name of the label starts with "G" code: library(pdftools) library(data.table) library(tabulizer) pdf_file <- "new.pdf" out2 <- extract_tables(pdf_file, pages =c(89), output =…
kumar
  • 5
  • 5
0
votes
1 answer

how to rename of a column header as per the next column in R programming

How to rename column headers that have "X or X.1 or X.3" values, but it should refer and rename with the next column's header. code: library(pdftools) library(data.table) library(tabulizer) pdf_file <- "new.pdf" out2 <- extract_tables(pdf_file,…
kumar
  • 5
  • 5
0
votes
1 answer

Scraping PDF in R with Nested Information

I am attempting to scrape a rather difficult PDF in R using both pdftools::pdf_text and tabulizer::extract_tables. However, in my situation, neither of these seems to be too helpful based on the nature of the PDF. The PDF contains "nested"…
mikeytop
  • 150
  • 9
0
votes
1 answer

Ways to extract images from pdf using R

Is there a way to extract images from pdf using R and save them into a folder? there are a lot of similar questions regarding other programming languages and there is apparently a way to do this in python, was wondering if the same work can be…
Bahi8482
  • 489
  • 5
  • 15