3

I am trying to follow this blog in trying to extract text from an invoice pdf file. My text extraction requires extraction specific fields of the invoice.

https://kaijento.github.io/2017/03/27/pdf-scraping-gwinnetttaxcommissioner.publicaccessnow.com/#pdftotext

I have tried pdfminer, textract but they all extract the text as jumbled and its difficult to extract text after that.

I came across Poppler package download below:

https://poppler.freedesktop.org/releases.html

Looks like its a .tar file. And not a python package.

Am not sure how to use this .tar file to extract the package and use it in Python.

Any suggestions how I install this on my mac and then use it programatically in python to run a bunch of pdf files through this to extract data.

Baktaawar
  • 7,086
  • 24
  • 81
  • 149

3 Answers3

3

Use subprocess to call the pdftotext program from the xpdf tools. You can find ms-windows versions of those tools at https://www.xpdfreader.com/download.html. Get the "Xpdf command line tools".

I use it like this (python 3.7):

import subprocess as sp

def pdftotext(path):
    """
    Generate a text rendering of a PDF file in the form of a list of lines.
    """
    args = ['pdftotext', '-layout', path, '-']
    cp = sp.run(
      args, stdout=sp.PIPE, stderr=sp.DEVNULL,
      check=True, text=True
    )
    return cp.stdout
Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • any idea how to get that installed for mac OS or linux? Do u think the same can be done via Poppler? Thnx – Baktaawar Apr 23 '20 at 18:22
  • @Baktaawar On Linux and macOS, install the `poppler-utils` package. [Available](https://pkgs.org/download/poppler-utils) on Linux and *BSD. – Roland Smith Apr 23 '20 at 19:07
  • which one of those? And I know Mac OS is linux based, but will the linux one work on mac? I tried doing xpdfreader installation from here https://www.xpdfreader.com/download.html Downloaded the mac one and did tar xvzj I got it unzippped. But when I run ./configure it says No such directory – Baktaawar Apr 23 '20 at 19:13
  • any idea how can we install this xpdreader on mac? it doesnt seem to hv a ./configure file – Baktaawar Apr 24 '20 at 03:33
  • Not every piece of software uses autoconf. It could be using another build system like `cmake`, `waf` or about a dozen others. – Roland Smith Apr 24 '20 at 05:38
0

You can try poppler for python here: https://pypi.org/project/python-poppler-qt5/

Vidyadhar Rao
  • 333
  • 3
  • 10
  • This is new. Didn't know that. Thanks for sending. I don't see functions to extract text using this.Do u knw how can we use the above .tar file to extract some stuff? – Baktaawar Apr 23 '20 at 17:00
  • I have installed this one from conda. Do u think same stuff? https://anaconda.org/conda-forge/poppler – Baktaawar Apr 27 '20 at 22:04
  • conda or pip must work. Alternatively, if you want to install from source code, extract files from .tar file and run: `python setup.py install` – Vidyadhar Rao Apr 28 '20 at 09:21
  • I installed from this https://anaconda.org/conda-forge/poppler Did conda install poppler. See it in conda env. But when I do import poppler it says no module. Is that the name? – Baktaawar Apr 28 '20 at 19:29
  • check the code for reference here: https://github.com/jalan/pdftotext – Vidyadhar Rao May 01 '20 at 16:43
0

Steps to Install poppler in Ubuntu:

sudo apt-get install libpoppler-cpp-dev

pip install --use-pep517 .
Akoffice
  • 341
  • 2
  • 6