Extracting text and other data from a PDF document, regardless of the libraries used to achieve this.
Questions tagged [pdf-extraction]
148 questions
0
votes
1 answer
How to retrieve ALL pages from PDF after button click and then insert it into a text editor PyPDF2
Getting stuck trying to get the entire range of pages to extract from a pdf before inserting it into a text box using PyPDF2. Only successful with individual pages (page = reader.pages[0]).
from tkinter import *
from tkinter import ttk
from tkinter…

Steve
- 65
- 4
0
votes
1 answer
Extract specific pages from a PDF file and save it with a specific name given on a excel using VBA or Python or VBA & Python
Let me mention below the steps and constrains.
I am using PDF Xchange editor
There is an excel document with PDF location (cell A1), PDF file name (cell A2), page range to extract/split which has start page (A3) and end page (A4) and finally that…
0
votes
1 answer
I want to use camelot for table extraction but its giving error
import camelot
tables = camelot.read_pdf(r"F:\testing\sbi_9.pdf", pages="all")
I have also downloaded GhostScript and still showing an error.
DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader…

Raj
- 1
- 1
0
votes
0 answers
Mapping Headings with their corresponding Paragraphs in a pdf file using python
Mapping Headings with their corresponding Paragraphs in a pdf file using python.
I got a pdf with headings and paragraphs and i want to map all the headings with their respective paragraph.I want this because i have a list of keywords that i need to…

Jai jazz
- 1
- 4
0
votes
0 answers
Encoded PDF File Parsing
I am trying to parse PDF file to text. That file can be downloaded from official goverment site, but I spent hours trying to decode it. Adobe Extractor came close, but not really sure, If I can configure it to parse it…

SomeGuy
- 97
- 10
0
votes
0 answers
Do you have to parse text in order to extract PDF file or can you extract PDF file without Parsing first to read specific text words?
Currently we are utilizing the Solimar system to send PDF invoice files daily (that contains different customers) into a specific folder (via a batch job -> one big PDF file). The invoices are also printed locally on a daily basis and delivered to…

Rifi J
- 11
- 5
0
votes
0 answers
Tabula-py - Pdf Extraction
while extracting table from pdf using tabula..last 3 rows are not extracting..can anyone let me know where I'm going wrong?
I used read_pdf and give the path,pages=all,multiple_table=True and stream=True as parameters
0
votes
0 answers
Extract similar value and it's following details from different format of PDF using python library or any ML model
I need to extract patient name and their age and table contents of test details [For example Test name,unit,value] from different format of PDF.
I uploaded 2 images for reference from different format of PDF with similar value.
How can I extract and…

Mathi
- 45
- 6
0
votes
1 answer
Write python script to remove specified text form PDF files
I am struggling to remove text from a pdf file. I know this can be performed manually with PDF editors but I have a few PDF files to modify. The code I have so far is able to recognise all the text in a pdf file but dpes not remove th text when it…

Ryno Smith
- 1
- 2
0
votes
0 answers
Extract the center-aligned lines from a PDF document using itext7 in C#.Net application
anybody help me extract the center-aligned lines from a PDF document using itext7 in .Net Core application.
I have written the following extraction code so far, but cannot get the lines that are center aligned. Is there any way, please help
private…

Emran Hossain
- 1
- 2
0
votes
0 answers
Trouble Reading PDFPY2
Open the PDF file
pdf_file = open(file, 'rb')
Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
Get the number of pages in the PDF file
pages = pdf_reader.numPages
Initialize a variable to store the extracted text
text = ''
Loop…

Pammvi Group
- 1
- 2
0
votes
0 answers
How to extract text with respect to the heading?
Actually, I am trying to extract section of pdf with respect to the heading like in the sample pdf file we select the heading ABSTRACT so as output we need the text from The game to penalty area. . I am trying the below code, but I am getting error.…

Laxmi
- 21
- 8
0
votes
0 answers
PDF to XML using adobe c#
I have adobe license version that (adobe acrobat) can help me to export PDF to XML.
Can i do same thing with c#? (I tried other options like spreadsheet, Images and other but i want to do Convert it into XML)
Thank you

Darshit Gandhi
- 121
- 1
- 7
0
votes
1 answer
How to correctly format this pdfplumber extract_table() output to DataFrame?
I have searched stack overflow on how to extract table information from a pdf without horizontal lines, and I am almost successful, however this brings me to my next problem. How to correctly output the data for use in a DataFrame.
The pdf tables in…

GT1992
- 79
- 6
0
votes
0 answers
extracting data from multiple pdfs and putting that data into an excel table
I am taking data extracted from multiple pdfs that were merged into one pdf.
The data is based on clinical measurements taken from a sample at different time points. Some time points have certain measurement values while others are missing.
So far,…

kpook
- 1