Questions tagged [pdftotext]

Pdftotext converts Portable Document Format (PDF) files to plain text.

is a command-line utility for converting PDF files to plain text files—i.e. extracting raw text from PDF-encapsulated files.

pdftotext is freely available and included by default with many Linux distributions, and is also available for Windows as part of the Xpdf Windows port. Poppler, which is derived from Xpdf, also includes an implementation of pdftotext and included as part of the -utils package on most major Linux distributions.

However, there are also others CLI-based PDF text extraction tools with a similar or equal name. While they (for the most part) work in the same way, they may give different results. So, only us this tag for CLI-based pdftotext tools and variants and make sure to point out your specific version and environment.

Do not use this tag if you use a different extraction tool, i.e. a GUI-based PDF to text converter, an online PDF to Text converter, or another (commercial) tool.

367 questions
2
votes
1 answer

Convert PDF to XLS

I want to convert PDF file into CSV or XLS. I tried doing this by using python tabula: #!/bin/bash #!/usr/bin/env python3 import tabula # Read pdf into list of DataFrame df = tabula.read_pdf("File1.pdf", pages='all') # convert PDF into CSV…
linux01
  • 41
  • 2
  • 7
2
votes
0 answers

Flutter web - getting text from pdf file

I am trying to get a text from pdf in the flutter web application. Plugins are available for android and ios but I could not find any plugin for the Web.
2
votes
3 answers

Remove header and footer from pdftotext module in Python

I am using pdftotext python package to extract text from pdf however I need to remove headers and footers from the text file to extract only the content. There could be two ways to solve this : Using regular expressions in text file Using some…
Raghav Gupta
  • 454
  • 3
  • 12
2
votes
0 answers

pdftotext returns blank to variable and "Syntax error: Document stream is empty"

my code is: text=`pdftotext -layout $Input` ;; echo "$text" but it returns a empty space string and the Syntax Error: Document Stream is Empty
okayish
  • 21
  • 3
2
votes
1 answer

Extract Text from a pdf only English text Canadian Legislation R

I'm trying to extract data from a Canadian Act for a project (in this case, the Food and Drugs Act), and import it into R. I want to break it up into 2 parts. 1st the table of contents (pic 1). Second, the information in the act (pic 2). But I do…
2
votes
0 answers

python pdfplumber error converting pdf to jpg FailedToExecuteCommand `"gswin64c.exe"

I am trying to convert pdf to image using pdfplumber in python (IDE JUPYTER) I have tried following code with pdfplumber.open("path to pdf") as pdf: first_page = pdf.pages[0] im = first_page.to_image() I have downloaded the dependencies…
Shyam
  • 357
  • 1
  • 9
2
votes
0 answers

Tika with Python to parse PDF text is not able to handle vertical text. Any ideas?

I am trying to extract text from a PDF using Python Tika library. The library is picking up text in the sequence I want. However, it is not able to handle vertically aligned text. For example, the word, is read as: V al ue s There are many…
user3865019
  • 45
  • 1
  • 6
2
votes
1 answer

How to run multiple files together on Refextract

I'm a novice in python and I need to extract references from scientific literature. Following is the code I'm using from refextract import extract_references_from_file import pandas as pd references =…
Vinoj Raj
  • 25
  • 5
2
votes
3 answers

Exception has occurred: ImportError DLL load failed while importing pdftotext: The specified module could not be found

I am new to Python and currently having trouble when importing some libraries. I have install pdftotext via pip install pdftotext and conda install -c conda-forge poppler after following the instruction from this link Unable to install pdftotext on…
Dominic
  • 341
  • 4
  • 15
2
votes
1 answer

PackagesNotFoundError: The following packages pykg-config are not available from current channels:

I'm trying to install some new packages pykg-config to get access to functions necessary for a university assignment. When I try to install, I get the following: Solving environment: failed with initial frozen solve. Retrying with flexible…
Dominic
  • 341
  • 4
  • 15
2
votes
1 answer

Regex to find paragraph that contains a sentence in a multi-line text

I have a pdf extract text that look like this ======================================== TITLE subtitle Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the…
2
votes
1 answer

Can't build a standalone .exe with the module pdftotext

I'm trying to convert my python script, which contains the module pdftotext, into a standalone .exe. When I test the .exe app in my anaconda env It works correctly but when I test it on another device It gaves me this error: File "main.py", line 3,…
Michele
  • 95
  • 8
2
votes
0 answers

Easy way to extract/validate data from OCR JSON result based on rules/selectors

My goal is to extract information from several different types of Invoices and transform that input into standard output. For now, all the Invoices are in PDF format (original digital pdfs, not printed!), so I don't think I need OCR but maybe in the…
2
votes
1 answer

PDFtotext - whitespace showing as aacute on commandline

I am extracting text using python from a textfile created from pdf using pdftotext. It is one of 2000 files and in this particular one, a line of keywords ends in EU. The remainder of the line is blank to the naked eye and so is the following…
2
votes
0 answers

Python - pdftotext keep formatting in a table like layout

I have a PDF document with below content (simplified): pdftotext mypdf.pdf -layout generates: Contact myemail@domain.com Now I have created a Python script, that can take "column-like" input, and parse the file accordingly.…
oliverbj
  • 5,771
  • 27
  • 83
  • 178