Questions tagged [python-pdfreader]

Python API to parse PDF documents, extract texts (plain and formatted), images, XObjects, Forms and other data. Provides direct access to all object attributes and object history. Follows PDF 1.7 specification.

Python API to parse PDF documents, extract texts (plain and formatted), images, XObjects, Forms and other data.

Follows PDF 1.7 specification.

Provides direct access to all object attributes and object history.

See pdfreader - Tutorials and Examples

32 questions
0
votes
1 answer

is there a way to measure margins of a pdf using python?

I've been using different python packages to parse PDFs, but I'm wondering if it's possible to measure the margins of a particular line in the document. The measurement I would like is for it to be in pixels css-style, if possible. It doesn't need…
Yehuda
  • 27
  • 15
0
votes
1 answer

How to use Python Fitz detect Hyphen when using search_for?

I'm new to the Fitz library and am working on a project where I need to find a string in a PDF page. I'm running into a case where the text on the page that I'm searching on is hyphenated. I am aware of the TEXT_DEHYPHENATE flag that I can use in…
Kevin Wu
  • 3
  • 1
  • 6
0
votes
0 answers

Python PdfReader: Getting error when sequentially reading PDFs in a folder: Errno 2 (No such file or directory): 'filename.pdf'

I'm trying to put together a code that will procedurally read through a file of PDFs to scrape relevant information such as part names, numbers, materials, and final treatments. The (presumably) problematic part of the code is written: for fp in…
Tyler
  • 1
0
votes
0 answers

can't read pdf files by using camelot

import camelot from google.colab import files uploaded = files.upload() file = "foo.pdf" tables = camelot.read_pdf(file) print("Total tables extracted:", tables.n) tables = camelot.read_pdf(file) print("Total tables extracted:",…
0
votes
1 answer

is there a way to read the contents of a pdf or word document in python while keeping its structure (level and depth of bulleted lists)

I want to generate a html code from a pdf or word document. The document contains bulleted lists and somes bulleted lists contains and other bulleted lists. I want to transfom that bulleted lists in html but when I extract the content of the…
0
votes
1 answer

Comparing keywords with PDF files

Here is the program that called the files through folder name and extract data. Now i want to compare the data with the keywords that I used in the program below. But it gives me: pdfReader = pdfFileObj.loadPage(0) AttributeError:…
0
votes
1 answer

Fields "Created" and "Modified" in Document Properties (PDF) were not displayed

Currently I have merged many PDFs together to create one PDF together. I have added metadata information which includes two fields "Created" and "Modified" but as a result these fields still do not display information. Here's my source code: import…
0
votes
2 answers

extract text from pdf File from S3 bucket python

I have multiple format files in my AWS s3 bucket like pdf,doc,rtf,odt,png and I need to extract text from it. I have managed to get the list of contents with their path .now depending on the file type i will use different libraries to extract text…
user14956888
0
votes
1 answer

can't use PyPDF2 to open my pdf file on jupyter notebook

I tried opening a pdf file which I downloaded with the PyPDF2 module already installed like this: import PyPDF2 pdfFileObj = open('ssopenpyxl-readthedocs-io-en-latest.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pdfReader.numPages and…
0
votes
1 answer

pdfplumber gives fp.seek(pos) AttributeError: 'dict' object has no attribute 'seek'

So this is my code: def main(): import combinedparser as cp from tkinter.filedialog import askopenfilenames files = askopenfilenames() print(files) #this gives the right files as a list of strings composed of path+filename …
0
votes
0 answers

How can we create a blank Pdf using pypdf2?

import PyPDF2 writer = PyPDF2.PdfFileWriter() writer.addBlankPage(219, 297) with open (r"C:\\Users\\Aditya\\.spyder-py3\\scripting in python\\sample pdf with python\\mergedpdf.pdf","wb") as file: writer.write(file) file.close() unable to…
0
votes
3 answers

Django open pdf on certain page number

I am trying to create a PDF analysis web app and I am stuck. I want to allow the user to open a certain page of the pdf that have over 300 pages in it. So, can anyone tell me how to use Django to open the pdf in a new tab on a specific page? EDIT…
0
votes
1 answer

How to read data from bank statement PDF in python?

I have to read the data from bank statement PDF which contains text and table. I have tried some solutions provided over stack-overflow but getting errors for the most of them. From many following one code worked for me but not getting expected…
0
votes
1 answer

How to store PDF in MySQL database without generating PDF file in Python

So basically I have a base64 encoded PDF data in MySQL database, And I want to manipulate that data ( Update the form fields of PDF file data), after that without creating/Write a PDF file I want to store that manipulated/updated data into a…
Chaitanya Bhojne
  • 156
  • 4
  • 10
0
votes
1 answer

Need help in importing data from pdfplumber to .csv file

I used pdfplumber to extract text from pdfs but when I tried to import the data using to_csv throwing #me an error. Need help in importing the data to .csv import pdfplumber import pandas as pd import numpy as np import os import re from collections…
Murthy P
  • 15
  • 1
  • 3
  • 8