Questions tagged [pdftotext]

Pdftotext converts Portable Document Format (PDF) files to plain text.

pdftotext is a command-line utility for converting PDF files to plain text files—i.e. extracting raw text from PDF-encapsulated files.

pdftotext is freely available and included by default with many Linux distributions, and is also available for Windows as part of the Xpdf Windows port. Poppler, which is derived from Xpdf, also includes an implementation of pdftotext and included as part of the poppler-utils package on most major Linux distributions.

However, there are also others CLI-based PDF text extraction tools with a similar or equal name. While they (for the most part) work in the same way, they may give different results. So, only us this tag for CLI-based pdftotext tools and variants and make sure to point out your specific version and environment.

Do not use this tag if you use a different extraction tool, i.e. a GUI-based PDF to text converter, an online PDF to Text converter, or another (commercial) tool.

367 questions

votes

1 answer

Installing pdftotext library on heroku

pdftotext library is a requirement in requirements.txt. While trying to push to heroku, I get the following error: remote: Running setup.py install for pdftotext: started remote: Running setup.py install for pdftotext: finished…

python heroku pdftotext

asked Jan 23 '19 at 12:12

Joel G Mathew

7,561
15
54
86

votes

3 answers

How can I determine which arguments a Python function takes?

Running the following code: pdf = pdftotext.PDF(f,layout='raw') produced this error: 'layout' is an invalid keyword argument for this function Is there a way to list which arguments this, and any, function would take?

python function arguments pdftotext

asked Nov 12 '18 at 07:04

Jack

votes

3 answers

cannot install pdftotext on windows because of poppler

I am trying to install pdftotext on windows: pip install pdftotext It failed originally because of lack of MS visual studio (now installed) and now it fails with a poppler problem. I have downloaded poppler and it is installed in C:\Program Files…

qt pip pdftotext poppler

asked Sep 14 '18 at 17:07

Psionman

3,084
1
32
65

votes

2 answers

Including pdftotext from poppler on AWS NodeJS Lambda function

I am using the node module pdf-to-text for my Nodejs lambda function, but I was getting a "spawn pdftotext ENOENT" error. I tried launching an AWS EC2 instance and compiling poppler there using this script. I managed to get a tar.gz file on S3 which…

node.js amazon-web-services aws-lambda spawn pdftotext

asked Sep 07 '16 at 20:26

user3321096

votes

1 answer

Running pdftotext from Python

I am trying to convert a pdf document to text document using pdftotext software. I need to call this application inc command prompt from python script to convert the file. I have following code: import os import subprocess path = "C:\\Users\\..."…

python windows subprocess pdftotext

asked Oct 23 '15 at 08:28

annamalai muthuraman

votes

3 answers

Replace only single occurrence of \n or \r in NSString

I am reading text from a PDF to NSString. I replace all the spaces using the code below NSString *pdfString = convertPDF(path); pdfString=[pdfString stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]; …

ios objective-c cocoa pdftotext

asked May 06 '15 at 12:56

ankit_rck

1,796
2
14
24

votes

4 answers

how to couple xargs with pdftotext converter to search inside multiple pdf files

I am making a script which is supposed to search inside all the pdf files in a directory. I have found one converted named "pdftotext" which enables me to use grep on pef files, but I am able to run it only with one file. When I want to run it over…

linux unix scripting xargs pdftotext

asked Mar 24 '15 at 12:05

user2809888

votes

0 answers

Extract text with style and format using TIKA from a PDF

I have a pdf file which contains section headings and its details, using Apache TIKA how do I extract text with its style and format?

apache apache-tika pdftotext

asked Feb 16 '15 at 14:10

Suresh Gorakala

votes

2 answers

PDFMiner - Get text lines

I'm converting PDF files to text with the PDFMiner Python library, using the code snippet provided in this SO answer. The problem is that the PDF is three column formatted, and I need to read each line. However, the text I get is unordered:…

python pdfminer pdftotext

asked Aug 06 '13 at 07:36

davids

6,259
3
29
50

votes

3 answers

Extract text content from PDF

I have been extracting text from PDFs using pdftotext. I have also done this with Ghostscript. Recently, a utility provider changed their PDFs so a portion of it is not being extracted by these methods. Specifically, I'm missing the due date and…

pdf ghostscript pdftotext

asked Feb 20 '13 at 17:26

Ben Walker

2,037
5
34
56

votes

3 answers

Convert PDF to text without pdftotext?

I have to convert PDFs to text and currently I am using pdftotext.exe. This messes up the resulting text sometimes and so I can't use that. Is there another free tool that I can call from another program? I'd prefer a command line tool.

pdf pdftotext

asked Jan 17 '12 at 08:40

EOB

2,975
17
43
70

votes

0 answers

Textract - windows10 - shell error - failed with exit code 127

The below code works fine for txt file but doesn't work with pdf files. import textract text = textract.process(r'C:\Users\Python_files\accounts.txt') However, I cannot seem to figure out what the problem is in the below code snippet: import…

python pypdf file-not-found pdfminer pdftotext

asked Apr 27 '23 at 06:32

Yukthi Bhat

votes

1 answer

How to solve (cid:x) pdfplumber python text extraction

PDF_Doc I've been working with the pdfplumber library to extract text from pdf documents and it's been fine, however in the documents I'm working on now, I just get spaces and lots of (cid:x) instead of text. Any solution? Thanks with…

python pypdf pdftotext pdfplumber

asked Nov 12 '22 at 22:03

foliveir

votes

3 answers

Issue with ligatures when converting PDF to text

I am running into an issue when trying to convert a PDF to text where the ligatures 'fi' 'ff' 'fl' are being converted to an empty space. I have read through quite a few similar threads on the issue but have not found a solution that works. This…

python pdf pdftotext pdfplumber

asked Sep 14 '22 at 19:48

Garrett

votes

2 answers

error : Microsoft Visual C++ 14.0 is required while installing pdftotext

I am trying to install the pdftotext library on a Miniconda environment. After using pip install pdftotext, I am getting an error : Microsoft Visual C++ 14.0 is required I already have Visual Studio Build Tools 2019 (16.11.8) installed but I still…

python visual-c++ pip pdftotext

asked Dec 16 '21 at 09:22

Samuel Ducloux

Prev 1 2

…

24 25 Next