Questions tagged [pdftotext]

Pdftotext converts Portable Document Format (PDF) files to plain text.

is a command-line utility for converting PDF files to plain text files—i.e. extracting raw text from PDF-encapsulated files.

pdftotext is freely available and included by default with many Linux distributions, and is also available for Windows as part of the Xpdf Windows port. Poppler, which is derived from Xpdf, also includes an implementation of pdftotext and included as part of the -utils package on most major Linux distributions.

However, there are also others CLI-based PDF text extraction tools with a similar or equal name. While they (for the most part) work in the same way, they may give different results. So, only us this tag for CLI-based pdftotext tools and variants and make sure to point out your specific version and environment.

Do not use this tag if you use a different extraction tool, i.e. a GUI-based PDF to text converter, an online PDF to Text converter, or another (commercial) tool.

367 questions
6
votes
1 answer

Where is file needed for PDFTOTEXT output in UTF-8 format?

I want to use the XPDF-based PDFTOTEXT command-line tool to look at PDF files, hoping to get UTF-8 output. I have seen others on StackOverflow getting it -- questions 4039930, 3809761 and 13618330 show that others have been able to use it. When I…
J.Merrill
  • 1,233
  • 14
  • 27
5
votes
0 answers

pdftotext get font information (font-family, style, size)

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML. Here's a sample line from the output: foo Is there a way to get font information for every word…
5
votes
1 answer

How to execute xpdf (pdftotext.exe) on shared drive?

im trying to parse pdf to text via PHP and XPDF (pdftotext.exe). On my localhost everythings works well, but when im trying to move everything on server, im getting into troubles. First of all i checked some settings on server and safe_mode is off,…
Luboš Suk
  • 1,526
  • 14
  • 38
4
votes
1 answer

How to use pdftotext library with "-layout" option in Python

I am using the Python library pdftotext to scrape the text of a PDF file. That works great but I need the "-layout" option that the command line tool offers with pdftotext -layout pdf_file.pdf. Not sure if that's possible without having to…
Alexandre
  • 105
  • 1
  • 6
4
votes
3 answers

calling pdftotext from python script not working when I change from local machine to my webhosting

I wrote a small python script to parse/extract info from a PDF. I tested it on my local machine, I have python 2.6.2 and pdftotext version 0.12.4. I am trying to run this on my webhosting server (dreamhost). It has python version 2.5.2 and pdftotext…
Chaitanya
  • 5,203
  • 8
  • 36
  • 61
4
votes
3 answers

Read PDF in Python and convert to text in PDF

I have used this code to convert pdf to text. input1 = '//Home//Sai Krishna Dubagunta.pdf' output = '//Home//Me.txt' os.system(("pdftotext %s %s") %( input1, output)) I have created the Home directory and pasted the source file in it. The output…
Krishna
  • 673
  • 3
  • 6
  • 21
4
votes
2 answers

Is it possible to extract a pdf with its white spaces in Python?

I have been attempting to extract a pdf with Python after a tool was created to extract it using java and pdfbox. While the Java implementation was successful for the same pdf, I have been struggling to do the same in python since both pdfminer and…
4
votes
2 answers

PHP Explode with an Unicode character as separator

XPDFs pdftotext converts pdf to text and outputs it at command line level. If needed it inserts PageBreaks between the pages as specified in TextOutputDev.cc: eopLen = uMap->mapUnicode(0x0c, eop, sizeof(eop)); This Unicode symbol is encoding…
sluijs
  • 4,146
  • 4
  • 30
  • 36
3
votes
1 answer

ways to separate passages in pdf using gap?

I have some pdf's with 2-3 passages for every page. every passage is separated by some line gap, but while reading with pymupdf, I cannot see any machine printable separator between passages. is there any other way, other library can do…
3
votes
3 answers

Textract: failed with exit code 127 // windows 10 // pdftotext

When I'm trying to run my (after deploying with pyinstaller) program for reading and converting a PDF file and entering it into a google sheet. I get the error shown in the image below. However I can not seem to figure out what the problem…
Thomas Broek
  • 59
  • 2
  • 7
3
votes
1 answer

I'm having difficulty installing pdftotext for python

I am trying to install pdftotext, but I keep receiving the same error even after installing the visual tools. This happens for both pip install and I am just trying to find it in my directory... Terminal output below: C:\Users\garec\Downloads>pip3…
3
votes
2 answers

iTextSharp.LGPLv2.Core get text from PDF into a string

recently our project upgraded to a new iTextSharp.LGPLv2.Core v1.6.5. I had a method which extracted a text from the PDF file. Back then I used this: if (File.Exists(pdf1Path)) { var pdfReader = new PdfReader(pdf1Path); …
Apuna12
  • 375
  • 2
  • 6
  • 23
3
votes
0 answers

Unable to Import Poppler even after installing in conda

I am trying to use pdf rendering package Poppler and I found an Anaconda Installation for the same here https://anaconda.org/conda-forge/poppler I can see the Poppler package installed in my conda env when I do conda list However when I…
Baktaawar
  • 7,086
  • 24
  • 81
  • 149
3
votes
3 answers

Installing Poppler for PDF text extraction

I am trying to follow this blog in trying to extract text from an invoice pdf file. My text extraction requires extraction specific fields of the invoice.…
Baktaawar
  • 7,086
  • 24
  • 81
  • 149
3
votes
3 answers

Unable to import pdftotext after installing with conda and poppler, Windows 10

I'm trying to use pdftotext, but it won't import. I'm running Windows 10 (64 bit) on a Lenovo IdeaPad S340, a work laptop. Following the directions here and here (which were super helpful), I: Installed Microsoft Visual C++ Build Tools. Installed…
Kaleb Coberly
  • 420
  • 1
  • 4
  • 19
1
2
3
24 25