Questions tagged [pdftotext]

Pdftotext converts Portable Document Format (PDF) files to plain text.

is a command-line utility for converting PDF files to plain text files—i.e. extracting raw text from PDF-encapsulated files.

pdftotext is freely available and included by default with many Linux distributions, and is also available for Windows as part of the Xpdf Windows port. Poppler, which is derived from Xpdf, also includes an implementation of pdftotext and included as part of the -utils package on most major Linux distributions.

However, there are also others CLI-based PDF text extraction tools with a similar or equal name. While they (for the most part) work in the same way, they may give different results. So, only us this tag for CLI-based pdftotext tools and variants and make sure to point out your specific version and environment.

Do not use this tag if you use a different extraction tool, i.e. a GUI-based PDF to text converter, an online PDF to Text converter, or another (commercial) tool.

367 questions
0
votes
1 answer

pdftotext output different on Windows 7 PC and linux server, why?

I am using the same version of xpdf on both machines. However, the .txt file created on the Windows 7 PC is different than that created on the Ubuntu 12.04 linux server. The Windows 7 .txt file is well formed with numerous line breaks that makes…
user2014597
  • 31
  • 1
  • 4
0
votes
1 answer

pdftotext with external URLs (PHP)

I want to make PDFs from external URLs searchable. I'm using pdftotext from XPDF. It's working fine with PDFs already on my webspace, but I keep getting an error message when trying to use external PDFs instead. Specifically I get: "Error: Couldn't…
Matt
  • 15
  • 1
  • 5
0
votes
1 answer

PHP - Convert PDF to Text (No access to exec/shell_exec)

The case: Server doesn't support exec/shell_exec (so pdftotext is excluded) Other libraries don't accept the PDF. Pdftotext works (tested on the files locally) Here are some excerpts from the (PDF)code: 5 0…
Simon
  • 5,464
  • 6
  • 49
  • 85
0
votes
1 answer

itextsharp PdfTextExtractor Spelling Words Wrong

There's a PDF in our database in binary. I streamed it out and saved it as a PDF file and tested with both sources and ended up with the same result: the PdfTextExtractor spells some words wrong. For example, there is a word, "confirmed" in the PDF.…
StronglyTyped
  • 2,134
  • 5
  • 28
  • 48
0
votes
2 answers

shell_exec() statement to pdftotext entire directory?

I'm at a loss as to how I could build a loop to pdftotext and entire directory through a shell_exec() statement. Something like : $pdfs = glob("*.pdf"); foreach($pdfs as $pdfs) { shell_exec('pdftotext '.$pdfs.' '.$pdfs'.txt'); } But I'm unsure…
RCNeil
  • 8,581
  • 12
  • 43
  • 61
-1
votes
0 answers

How can i search a keyword or text at a time in multiple pdfs which is saved in s3 using Laravel and vuejs

I have an application which is build in vuejs and Laravel 8. I am handling multiple users with different pdf documents. I have a search page for these users. My requirement is this, When a keyword is searched in a field i wanted that to be checked…
aseel
  • 431
  • 5
  • 8
-1
votes
1 answer

Can't import pdftotext in python in my Mac M1

I can't import pdftext in my new mac M1. The steps I took are: Install python 3.10 Install command line developer tools pip3 install pdftotext from terminal Open IDLE, type import pdftotext I get this error: Traceback (most recent call…
Antonio
  • 21
  • 6
-1
votes
1 answer

Strange 1 byte character result with pdftotext from .pdf to .txt

I have this weird result when transferring a single pdf with no content to a .txt file. I am using this PHP code in a foreach for all the files found in the dir. It works ridiculously well with the -raw option if there is text available in the…
KJS
  • 1,176
  • 1
  • 13
  • 29
-1
votes
1 answer

Extracting text from PDF - Rstudio

Using pdftools library, I was able to extract only 3 pages of pdf file which has 30 pages. What can be the the issue? How do I extract text from all the pages? First 3 pages contains normal text and many other pages contains tabular column
-1
votes
1 answer

Tex to Word Pipeline With Reference File

I would like to convert my overleaf template to a word document for my collaborators to edit directly outside of Overleaf. I am aware of Pandoc to convert the text file to word pandoc -o Test.docx Test.tex However, my tex document uses references…
Cody Glickman
  • 514
  • 1
  • 8
  • 30
-1
votes
1 answer

Python code to read excel document and verify if info on scanned paperwork is among the list then separate items in different files

I find myself in a situation where multiple sheets of paper are printed out, containing information that must be verified on an excel document i receive through e-mail. My role is to check if the received sheets are among the excel list by checking…
-1
votes
1 answer

TypeError while converting pdf to txt file

I have written a function that converts each pdf from a directory into text and I want to get the converted text from the pdf's as txt files. I am getting "TypeError: expected str, bytes or os.PathLike object, not tuple" error in my code. Can anyone…
Swordsman
  • 143
  • 1
  • 2
  • 14
-1
votes
2 answers

Extracting text from a PDF file using Python 2.7 on Windows 7

I have been using PyPDF2 to extract the text included in this PDF file (generated with pdfTeX-1.40.0) using Python 2.7. It works fine but now i have to extract text from same pdf generated with LibreOffice 4.3 and the result is this(not whole): ˜ !…
Budlog
  • 79
  • 10
-1
votes
1 answer

R/R Studio: Iterate Folder of PDFs and Convert to R Objects

I'm using RStudio Version 1.0.153. I have a folder of approximately 30 PDFs. I would like to convert them to respective objects in R as character strings. I already have the pdftools package and it successfully converts to objects, I'm just looking…
MeeraWhy
  • 93
  • 6
-1
votes
1 answer

Cannot select PDF from top to bottom

I'm using pdftotext to extract info from a pdf. Currently using the -raw option. I do have a few problems with the PDFs I'm working with. If I select the text from top to bottom it selects in the following fashion. PDF content: A B C It selects A…
eatorres
  • 169
  • 1
  • 15
1 2 3
24
25