Highest Voted 'pdftools' Questions

2

votes

1 answer

Multi column PDF text without tables and footnotes

I'm dealing with PDFs in my research and I wrote a R scraper for some textdata. Everything works fine and I can read the data…

r text-mining pdftools

asked Sep 10 '21 at 10:20

Martin

312
2
15

2

votes

0 answers

R not allowing to write documents, Error in cpp_pdf_select(input, output, pages, password)

I'm trying to use the following code, which is supossed to order the pages of a pdf: ordenar <- function(archivo) { library(pdftools) nombre <- sapply(strsplit(sapply(strsplit(pdftools::pdf_text(archivo), …

r pdftools

asked Jun 21 '21 at 22:48

Fer

43
3

2

votes

1 answer

Extract Text from a pdf only English text Canadian Legislation R

I'm trying to extract data from a Canadian Act for a project (in this case, the Food and Drugs Act), and import it into R. I want to break it up into 2 parts. 1st the table of contents (pic 1). Second, the information in the act (pic 2). But I do…

r pdftotext tabulizer pdftools

asked Feb 27 '21 at 03:52

Alex Betsos

71
6

2

votes

1 answer

how to change tesseract config to recognize § and apply with pdftools::pdf_ocr_text in R?

I am using pdftools in R to extract text from both scanned and text based PDF files. One problem is with the § character. This is not recognized by tesseract. I looked at the following links: CRAN tesseract package vignette SO link of a similar…

r ocr tesseract pdftools

asked Dec 01 '20 at 15:33

maop

194
14

2

votes

1 answer

Extract text well from a PDF with two columns in R

I am trying to extract the texts of the annual reports of the companies. Its design is in the majority of two columns. So I don't know how to extract it correctly, since in R I with the pdftools package, I extract the first line of the first column…

r pdf text-mining pdftools

asked Sep 18 '20 at 12:03

David Perea

139
3
12

2

votes

1 answer

How to extract specific parts of messy PDFs in R?

I need to extract specific parts of a large corpus of PDF documents. The PDFs are large and messy reports containing all kinds of digital, alphabetic and other information. The files are of different length but have unified content and sections…

r pdf text nlp pdftools

asked Aug 06 '20 at 14:19

Niki

23
3

2

votes

1 answer

Reading a Fixed-Width Multi-Line File in R

I have data from a PDF file that I am reading into R. library(pdftools) library(readr) library(stringr) library(dplyr) results <- pdf_text("health_data.pdf") %>% readr::read_lines() When I read it in with this method, a character vector is…

r readr pdftools

asked Apr 26 '20 at 22:16

daneshjai

858
3
10
17

2

votes

2 answers

How to Remove "|" Without Leaving Space from the List in R

I am using the pdf tool to extract data from the scanned file by transforming to png first. Since the pdf tool read from png, there were some punctuations showing up for no reason. I can remove most of them except for "|". Here is my data: c("|…

r tesseract pdftools

asked Mar 23 '20 at 23:16

Guolin Zhang

23
4

2

votes

0 answers

Read flowdiagram as sequential text in R

I have a flowdiagram as a PDF. I want to extract the text as sequential array/vector in R. Is there an efficient way to do this? As an example I am looking at whether we can have a vector 1. Start App 2. Speech Input 3. HTTP POSt Request .. ...

r dplyr pdftools

asked Dec 12 '19 at 07:29

NinjaR

621
6
22

2

votes

0 answers

subscript out of bounds. Extracting PDF

I am extracting text from a pdf. Removing punctuation and looking at key repeated words and how often they appear. library(pdftools) library(tm) setwd("S:/Shared Folders/Impact Investing/Investment/Scripts/PDF") files <- list.files(pattern =…

r pdftools

asked Sep 26 '19 at 07:05

Will

35
5

1

vote

1 answer

R pdftools returning different units for PDF text coordinates

In the package pdftools, there are two functions pdf_data() (which works on pre-OCR'd PDF files) and pdf_ocr_data() (which will OCR a PDF file regardless of whether it is already OCR'd or not). pdf_data() results in a list of tibbles, each with 6…

r pdf coordinates ocr pdftools

asked Jun 13 '23 at 16:56

pseudorandom

142
1
1
10

1

vote

1 answer

Issue with staplr package in R: set_fields returns the error 'Error: All unnamed arguments must be length 1'

I have been trying to generate a pdf file in R using the staplr package. However I have been running into issues whilst trying to run the example code. I keep getting an error (All unnamed arguments must be length 1) when trying to use the…

r installation pdftools

asked Jan 25 '23 at 01:40

mouldyeclair

11
2

1

vote

1 answer

How to convert raw lines to df

I need to read a df from a pdf file and here is an example table So far I was able to read the data as raw lines with the following chunk library(pdftools) library(tidyverse) pdf_file <- pdf_text("exm.pdf") raw_df <- pdf_file %>% read_lines()…

r pdf text-mining pdftools

asked Oct 19 '22 at 09:20

Satya Pamidi

143
8

1

vote

1 answer

Load only the names of many pdfs and make data frame

Im need obtain the names of set a many pdf files (36000 files). But only the names not load all object. Finally make a data frame like this: The link of 21 example…

r tidyverse pdftools

asked Aug 26 '22 at 20:37

Miguel Angel Acosta Chinchilla

162
8

1

vote

1 answer

separating multiple columns into more columns

The text from a pdf I scraped is jumbled up in different elements. Not to mention, it deleted data when it was converted to a data frame. It's really hard to tell where the text should have been split since it seems like I got it correct in the…

r dataframe split multiple-columns pdftools

asked Jun 06 '22 at 00:03

bandcar

649
4
11

Questions tagged [pdftools]