Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

  • Written entirely in Python. (for version 2.4 or newer)
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost)
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions
-1
votes
1 answer

How to avoid PDF files with incorrect password error using PDFminer

I want to gather all PDF files from my computer and extract the text from each one. Both functions that I have currently do that, however, some PDF files are giving me this error: raise PDFPasswordIncorrect…
Cald0002
  • 13
  • 4
-1
votes
1 answer

TypeError while converting pdf to txt file

I have written a function that converts each pdf from a directory into text and I want to get the converted text from the pdf's as txt files. I am getting "TypeError: expected str, bytes or os.PathLike object, not tuple" error in my code. Can anyone…
Swordsman
  • 143
  • 1
  • 2
  • 14
-1
votes
1 answer

How can i use regex in my pdfminer code to extract text between two headings?

I have several PDFs that i want to extract data from. I have managed to use the code below to extract all the data from the PDF however now i want to extract text between two different headings. I believe using regex is the best way to do this as…
Jlingz14
  • 47
  • 6
-1
votes
2 answers

Extracting text from a PDF file using Python 2.7 on Windows 7

I have been using PyPDF2 to extract the text included in this PDF file (generated with pdfTeX-1.40.0) using Python 2.7. It works fine but now i have to extract text from same pdf generated with LibreOffice 4.3 and the result is this(not whole): ˜ !…
Budlog
  • 79
  • 10
-1
votes
1 answer

pdfminer.six installation: works okay in cmd prompt, but returns syntax error in the shell

I used pip install pdfminer.six in the command prompt, and the installation was successful. When I run pdf2txt.py C:\Python27\pdfminer\samples\simple1.pdf in the command prompt, the command was successful and returned this: c:\Python27>pdf2txt.py…
Bec
  • 17
  • 2
-1
votes
1 answer

Reading a pdf file using python

I have a pdf form that has been converted into normal pdf document(using print2pdf software) . I intend to extract the data from the same , is there any way of doing so ? I am currently using pdfminer , but it tends not extract the data entered by…
misguided
  • 3,699
  • 21
  • 54
  • 96
-1
votes
1 answer

How to acces an existing(!) matrix which partly contains invalid syntax?

I use pdfminer to convert pdf-text into txt. The pdfminer goes through the pdf-file and reads it out line by line. Each line is assigned to a matrix variable. The problem is, that for some reason in rare cases the matrix is for e. g. like x = [[Г,…
-1
votes
1 answer

PDFQuery + files on server

I'm trying to search for text string, say "can be", in document which is located on 'https://developer.apple.com/library/ios/documentation/ides/conceptual/AppDistributionGuide/AppDistributionGuide.pdf' For this purpose I'm using PDFQuery. Initially…
-2
votes
1 answer

How to convert pdf to HTML using python pdfminer?

Is there any code snippet that will work? I have tried this for converting pdf to html from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfpage import PDFPage from pdfminer.converter import HTMLConverter, TextConverter from…
jerin
  • 1
  • 1
-3
votes
1 answer

Extract URLS,BOOKMARKS, MARKUPs and Comments from a pdf using PyPDF2 or Pdfminer

I tried to extract pdf urls,comments or bookmarsk from the pdf using pypdf2 or pdfminer. I cant see /Annots or URI even if there are urls or bookmarsk present in the pdf.
user222213
  • 111
  • 1
  • 2
  • 12
-3
votes
1 answer

How to group the power point images programatically

I'm trying to extract the images from pdf using pdf miner module.I want to extract the graph image as single image but actually module is not returning the whole graph image instead its returning the separated images.I have converted the pdf to…
mani
  • 15
  • 1
  • 5
-4
votes
2 answers

I want to extract text from a PDF to a .text file using PDFminer. I have found the code but I have no idea how to use it

This is the code I found somewhere here. I have no idea how to use it. Can someone walk me through this and help me convert a sample pdf? from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import…
iMiner
  • 43
  • 1
  • 1
  • 6
1 2 3
32
33