Questions tagged [pdftotext]

Pdftotext converts Portable Document Format (PDF) files to plain text.

pdftotext is a command-line utility for converting PDF files to plain text files—i.e. extracting raw text from PDF-encapsulated files.

pdftotext is freely available and included by default with many Linux distributions, and is also available for Windows as part of the Xpdf Windows port. Poppler, which is derived from Xpdf, also includes an implementation of pdftotext and included as part of the poppler-utils package on most major Linux distributions.

However, there are also others CLI-based PDF text extraction tools with a similar or equal name. While they (for the most part) work in the same way, they may give different results. So, only us this tag for CLI-based pdftotext tools and variants and make sure to point out your specific version and environment.

Do not use this tag if you use a different extraction tool, i.e. a GUI-based PDF to text converter, an online PDF to Text converter, or another (commercial) tool.

367 questions

votes

1 answer

Where is file needed for PDFTOTEXT output in UTF-8 format?

I want to use the XPDF-based PDFTOTEXT command-line tool to look at PDF files, hoping to get UTF-8 output. I have seen others on StackOverflow getting it -- questions 4039930, 3809761 and 13618330 show that others have been able to use it. When I…

utf-8 pdftotext

asked Nov 21 '13 at 17:09

J.Merrill

1,233
14
27

votes

0 answers

pdftotext get font information (font-family, style, size)

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML. Here's a sample line from the output: foo Is there a way to get font information for every word…

text-extraction pdftotext poppler pdf-scraping xpdf

asked May 06 '18 at 11:23

James Kroning

votes

1 answer

How to execute xpdf (pdftotext.exe) on shared drive?

im trying to parse pdf to text via PHP and XPDF (pdftotext.exe). On my localhost everythings works well, but when im trying to move everything on server, im getting into troubles. First of all i checked some settings on server and safe_mode is off,…

php cmd exec pdftotext xpdf

asked Jan 28 '16 at 14:04

Luboš Suk

1,526
14
38

votes

1 answer

How to use pdftotext library with "-layout" option in Python

I am using the Python library pdftotext to scrape the text of a PDF file. That works great but I need the "-layout" option that the command line tool offers with pdftotext -layout pdf_file.pdf. Not sure if that's possible without having to…

python pdftotext

asked Apr 15 '21 at 09:53

Alexandre

votes

3 answers

calling pdftotext from python script not working when I change from local machine to my webhosting

I wrote a small python script to parse/extract info from a PDF. I tested it on my local machine, I have python 2.6.2 and pdftotext version 0.12.4. I am trying to run this on my webhosting server (dreamhost). It has python version 2.5.2 and pdftotext…

python scripting subprocess dreamhost pdftotext

asked Jan 29 '11 at 13:29

Chaitanya

5,203
8
36
61

votes

3 answers

Read PDF in Python and convert to text in PDF

I have used this code to convert pdf to text. input1 = '//Home//Sai Krishna Dubagunta.pdf' output = '//Home//Me.txt' os.system(("pdftotext %s %s") %( input1, output)) I have created the Home directory and pasted the source file in it. The output…

python pdftotext

asked May 23 '14 at 04:55

Krishna

votes

2 answers

Is it possible to extract a pdf with its white spaces in Python?

I have been attempting to extract a pdf with Python after a tool was created to extract it using java and pdfbox. While the Java implementation was successful for the same pdf, I have been struggling to do the same in python since both pdfminer and…

python pypdf pdftotext

asked Jun 16 '13 at 04:38

Oeufcoque Penteano

votes

2 answers

PHP Explode with an Unicode character as separator

XPDFs pdftotext converts pdf to text and outputs it at command line level. If needed it inserts PageBreaks between the pages as specified in TextOutputDev.cc: eopLen = uMap->mapUnicode(0x0c, eop, sizeof(eop)); This Unicode symbol is encoding…

php unicode explode pdftotext xpdf

asked Sep 02 '12 at 09:36

sluijs

4,146
4
30
36

votes

1 answer

ways to separate passages in pdf using gap?

I have some pdf's with 2-3 passages for every page. every passage is separated by some line gap, but while reading with pymupdf, I cannot see any machine printable separator between passages. is there any other way, other library can do…

pdf pdfminer pdftotext pymupdf pdfium

asked Sep 02 '22 at 09:24

Saivenkataraju

votes

3 answers

Textract: failed with exit code 127 // windows 10 // pdftotext

When I'm trying to run my (after deploying with pyinstaller) program for reading and converting a PDF file and entering it into a google sheet. I get the error shown in the image below. However I can not seem to figure out what the problem…

python pyinstaller file-not-found pypdf pdftotext

asked Aug 11 '20 at 11:49

Thomas Broek

votes

1 answer

I'm having difficulty installing pdftotext for python

I am trying to install pdftotext, but I keep receiving the same error even after installing the visual tools. This happens for both pip install and I am just trying to find it in my directory... Terminal output below: C:\Users\garec\Downloads>pip3…

python pdftotext poppler

asked Jul 18 '20 at 22:15

Alfonso Garcia

votes

2 answers

iTextSharp.LGPLv2.Core get text from PDF into a string

recently our project upgraded to a new iTextSharp.LGPLv2.Core v1.6.5. I had a method which extracted a text from the PDF file. Back then I used this: if (File.Exists(pdf1Path)) { var pdfReader = new PdfReader(pdf1Path); …

c# .net pdf itext pdftotext

asked Jun 18 '20 at 11:55

Apuna12

votes

0 answers

Unable to Import Poppler even after installing in conda

I am trying to use pdf rendering package Poppler and I found an Anaconda Installation for the same here https://anaconda.org/conda-forge/poppler I can see the Poppler package installed in my conda env when I do conda list However when I…

python pdftotext poppler

asked Apr 28 '20 at 19:40

Baktaawar

7,086
24
81
149

votes

3 answers

Installing Poppler for PDF text extraction

I am trying to follow this blog in trying to extract text from an invoice pdf file. My text extraction requires extraction specific fields of the invoice.…

python pdftotext poppler

asked Apr 23 '20 at 16:18

Baktaawar

7,086
24
81
149

votes

3 answers

Unable to import pdftotext after installing with conda and poppler, Windows 10

I'm trying to use pdftotext, but it won't import. I'm running Windows 10 (64 bit) on a Lenovo IdeaPad S340, a work laptop. Following the directions here and here (which were super helpful), I: Installed Microsoft Visual C++ Build Tools. Installed…

python anaconda python-import importerror pdftotext

asked Jan 29 '20 at 03:05

Kaleb Coberly

Prev 1

…

24 25 Next