Questions tagged [pdftotext]

Pdftotext converts Portable Document Format (PDF) files to plain text.

is a command-line utility for converting PDF files to plain text files—i.e. extracting raw text from PDF-encapsulated files.

pdftotext is freely available and included by default with many Linux distributions, and is also available for Windows as part of the Xpdf Windows port. Poppler, which is derived from Xpdf, also includes an implementation of pdftotext and included as part of the -utils package on most major Linux distributions.

However, there are also others CLI-based PDF text extraction tools with a similar or equal name. While they (for the most part) work in the same way, they may give different results. So, only us this tag for CLI-based pdftotext tools and variants and make sure to point out your specific version and environment.

Do not use this tag if you use a different extraction tool, i.e. a GUI-based PDF to text converter, an online PDF to Text converter, or another (commercial) tool.

367 questions
3
votes
1 answer

Installing pdftotext library on heroku

pdftotext library is a requirement in requirements.txt. While trying to push to heroku, I get the following error: remote: Running setup.py install for pdftotext: started remote: Running setup.py install for pdftotext: finished…
Joel G Mathew
  • 7,561
  • 15
  • 54
  • 86
3
votes
3 answers

How can I determine which arguments a Python function takes?

Running the following code: pdf = pdftotext.PDF(f,layout='raw') produced this error: 'layout' is an invalid keyword argument for this function Is there a way to list which arguments this, and any, function would take?
Jack
  • 313
  • 5
  • 22
3
votes
3 answers

cannot install pdftotext on windows because of poppler

I am trying to install pdftotext on windows: pip install pdftotext It failed originally because of lack of MS visual studio (now installed) and now it fails with a poppler problem. I have downloaded poppler and it is installed in C:\Program Files…
Psionman
  • 3,084
  • 1
  • 32
  • 65
3
votes
2 answers

Including pdftotext from poppler on AWS NodeJS Lambda function

I am using the node module pdf-to-text for my Nodejs lambda function, but I was getting a "spawn pdftotext ENOENT" error. I tried launching an AWS EC2 instance and compiling poppler there using this script. I managed to get a tar.gz file on S3 which…
3
votes
1 answer

Running pdftotext from Python

I am trying to convert a pdf document to text document using pdftotext software. I need to call this application inc command prompt from python script to convert the file. I have following code: import os import subprocess path = "C:\\Users\\..."…
3
votes
3 answers

Replace only single occurrence of \n or \r in NSString

I am reading text from a PDF to NSString. I replace all the spaces using the code below NSString *pdfString = convertPDF(path); pdfString=[pdfString stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]; …
ankit_rck
  • 1,796
  • 2
  • 14
  • 24
3
votes
4 answers

how to couple xargs with pdftotext converter to search inside multiple pdf files

I am making a script which is supposed to search inside all the pdf files in a directory. I have found one converted named "pdftotext" which enables me to use grep on pef files, but I am able to run it only with one file. When I want to run it over…
user2809888
3
votes
0 answers

Extract text with style and format using TIKA from a PDF

I have a pdf file which contains section headings and its details, using Apache TIKA how do I extract text with its style and format?
3
votes
2 answers

PDFMiner - Get text lines

I'm converting PDF files to text with the PDFMiner Python library, using the code snippet provided in this SO answer. The problem is that the PDF is three column formatted, and I need to read each line. However, the text I get is unordered:…
davids
  • 6,259
  • 3
  • 29
  • 50
3
votes
3 answers

Extract text content from PDF

I have been extracting text from PDFs using pdftotext. I have also done this with Ghostscript. Recently, a utility provider changed their PDFs so a portion of it is not being extracted by these methods. Specifically, I'm missing the due date and…
Ben Walker
  • 2,037
  • 5
  • 34
  • 56
2
votes
3 answers

Convert PDF to text without pdftotext?

I have to convert PDFs to text and currently I am using pdftotext.exe. This messes up the resulting text sometimes and so I can't use that. Is there another free tool that I can call from another program? I'd prefer a command line tool.
EOB
  • 2,975
  • 17
  • 43
  • 70
2
votes
0 answers

Textract - windows10 - shell error - failed with exit code 127

The below code works fine for txt file but doesn't work with pdf files. import textract text = textract.process(r'C:\Users\Python_files\accounts.txt') However, I cannot seem to figure out what the problem is in the below code snippet: import…
2
votes
1 answer

How to solve (cid:x) pdfplumber python text extraction

PDF_Doc I've been working with the pdfplumber library to extract text from pdf documents and it's been fine, however in the documents I'm working on now, I just get spaces and lots of (cid:x) instead of text. Any solution? Thanks with…
foliveir
  • 59
  • 5
2
votes
3 answers

Issue with ligatures when converting PDF to text

I am running into an issue when trying to convert a PDF to text where the ligatures 'fi' 'ff' 'fl' are being converted to an empty space. I have read through quite a few similar threads on the issue but have not found a solution that works. This…
Garrett
  • 21
  • 2
2
votes
2 answers

error : Microsoft Visual C++ 14.0 is required while installing pdftotext

I am trying to install the pdftotext library on a Miniconda environment. After using pip install pdftotext, I am getting an error : Microsoft Visual C++ 14.0 is required I already have Visual Studio Build Tools 2019 (16.11.8) installed but I still…
1 2
3
24 25