Questions tagged [pdftotext]

Pdftotext converts Portable Document Format (PDF) files to plain text.

pdftotext is a command-line utility for converting PDF files to plain text files—i.e. extracting raw text from PDF-encapsulated files.

pdftotext is freely available and included by default with many Linux distributions, and is also available for Windows as part of the Xpdf Windows port. Poppler, which is derived from Xpdf, also includes an implementation of pdftotext and included as part of the poppler-utils package on most major Linux distributions.

However, there are also others CLI-based PDF text extraction tools with a similar or equal name. While they (for the most part) work in the same way, they may give different results. So, only us this tag for CLI-based pdftotext tools and variants and make sure to point out your specific version and environment.

Do not use this tag if you use a different extraction tool, i.e. a GUI-based PDF to text converter, an online PDF to Text converter, or another (commercial) tool.

367 questions

votes

1 answer

Convert PDF to XLS

I want to convert PDF file into CSV or XLS. I tried doing this by using python tabula: #!/bin/bash #!/usr/bin/env python3 import tabula # Read pdf into list of DataFrame df = tabula.read_pdf("File1.pdf", pages='all') # convert PDF into CSV…

python pdf python-3.7 pdftotext tabula

asked Oct 20 '21 at 11:41

linux01

votes

0 answers

Flutter web - getting text from pdf file

I am trying to get a text from pdf in the flutter web application. Plugins are available for android and ios but I could not find any plugin for the Web.

flutter pdf flutter-web pdftotext

asked Oct 10 '21 at 02:45

Paras Mittal

votes

3 answers

Remove header and footer from pdftotext module in Python

I am using pdftotext python package to extract text from pdf however I need to remove headers and footers from the text file to extract only the content. There could be two ways to solve this : Using regular expressions in text file Using some…

python ocr text-extraction pdftotext

asked May 13 '21 at 08:35

Raghav Gupta

votes

0 answers

pdftotext returns blank to variable and "Syntax error: Document stream is empty"

my code is: text=`pdftotext -layout $Input` ;; echo "$text" but it returns a empty space string and the Syntax Error: Document Stream is Empty

shell pdftotext

asked Apr 17 '21 at 12:35

okayish

votes

1 answer

Extract Text from a pdf only English text Canadian Legislation R

I'm trying to extract data from a Canadian Act for a project (in this case, the Food and Drugs Act), and import it into R. I want to break it up into 2 parts. 1st the table of contents (pic 1). Second, the information in the act (pic 2). But I do…

r pdftotext tabulizer pdftools

asked Feb 27 '21 at 03:52

Alex Betsos

votes

0 answers

python pdfplumber error converting pdf to jpg FailedToExecuteCommand `"gswin64c.exe"

I am trying to convert pdf to image using pdfplumber in python (IDE JUPYTER) I have tried following code with pdfplumber.open("path to pdf") as pdf: first_page = pdf.pages[0] im = first_page.to_image() I have downloaded the dependencies…

python-3.x pdfminer pdftotext tabula

asked Sep 11 '20 at 09:37

Shyam

votes

0 answers

Tika with Python to parse PDF text is not able to handle vertical text. Any ideas?

I am trying to extract text from a PDF using Python Tika library. The library is picking up text in the sequence I want. However, it is not able to handle vertically aligned text. For example, the word, is read as: V al ue s There are many…

python apache-tika pdftotext vertical-text

asked Aug 27 '20 at 09:51

user3865019

votes

1 answer

How to run multiple files together on Refextract

I'm a novice in python and I need to extract references from scientific literature. Following is the code I'm using from refextract import extract_references_from_file import pandas as pd references =…

python python-3.x reference pdftotext

asked Aug 15 '20 at 16:24

Vinoj Raj

votes

3 answers

Exception has occurred: ImportError DLL load failed while importing pdftotext: The specified module could not be found

I am new to Python and currently having trouble when importing some libraries. I have install pdftotext via pip install pdftotext and conda install -c conda-forge poppler after following the instruction from this link Unable to install pdftotext on…

python django pdftotext

asked Jul 01 '20 at 07:11

Dominic

votes

1 answer

PackagesNotFoundError: The following packages pykg-config are not available from current channels:

I'm trying to install some new packages pykg-config to get access to functions necessary for a university assignment. When I try to install, I get the following: Solving environment: failed with initial frozen solve. Retrying with flexible…

python anaconda pdftotext

asked Jun 29 '20 at 10:20

Dominic

votes

1 answer

Regex to find paragraph that contains a sentence in a multi-line text

I have a pdf extract text that look like this ======================================== TITLE subtitle Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the…

python regex pdftotext

asked May 25 '20 at 20:25

Bruno Neves

votes

1 answer

Can't build a standalone .exe with the module pdftotext

I'm trying to convert my python script, which contains the module pdftotext, into a standalone .exe. When I test the .exe app in my anaconda env It works correctly but when I test it on another device It gaves me this error: File "main.py", line 3,…

anaconda pyinstaller pdftotext poppler

asked Mar 31 '20 at 17:52

Michele

votes

0 answers

Easy way to extract/validate data from OCR JSON result based on rules/selectors

My goal is to extract information from several different types of Invoices and transform that input into standard output. For now, all the Invoices are in PDF format (original digital pdfs, not printed!), so I don't think I need OCR but maybe in the…

ocr google-cloud-vision azure-cognitive-services pdftotext amazon-textract

asked Jul 22 '19 at 10:55

João Antunes

votes

1 answer

PDFtotext - whitespace showing as aacute on commandline

I am extracting text using python from a textfile created from pdf using pdftotext. It is one of 2000 files and in this particular one, a line of keywords ends in EU. The remainder of the line is blank to the naked eye and so is the following…

python character-encoding removing-whitespace pdftotext

asked Apr 16 '11 at 23:10

jobucks

votes

0 answers

Python - pdftotext keep formatting in a table like layout

I have a PDF document with below content (simplified): pdftotext mypdf.pdf -layout generates: Contact myemail@domain.com Now I have created a Python script, that can take "column-like" input, and parse the file accordingly.…

python python-3.x pdftotext

asked Jun 24 '19 at 18:16

oliverbj

5,771
27
83
178

Prev 1 2 3

…

24 25 Next