Extracting text and other data from a PDF document, regardless of the libraries used to achieve this.
Questions tagged [pdf-extraction]
148 questions
1
vote
2 answers
Remove whitespace from PDF Document
I am using Camelot-py to read and extract attributes from several PDFs. I use table_areas to extract some of the attributes and I am facing difficulties in setting the correct areas, due to the deviation in X or Y co-ordinates between some of the…

A.A. F
- 349
- 5
- 16
1
vote
1 answer
Node.js - Problem to extract text from PDF file using Google Cloud Vision API
I'm new to cloud environments and programming in general, and I'm struggling to use the Google Vision API to extract text from a PDF file located in a remote bucket.
I've found it really difficult to get meaningful content related to this subject in…

Otávio Augusto
- 27
- 4
1
vote
1 answer
Extracting specific segments from PDF document
I have a few research papers in pdf format and I want to extract just the introduction/background etc from the paper. also, I can only use python. can someone please help?

Cheryl
- 27
- 1
- 12
1
vote
0 answers
getting java.lang.ClassNotFoundException: org.apache.pdfbox.exceptions.CryptographyException when using Lucene-PDFbox jar
When I am running this code, I am getting following Exception. Running fine with only PDFBox jar. getting this exception Lucene-PDFBox jar only.
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import…

monty
- 108
- 1
- 2
- 7
1
vote
1 answer
Extracting .pdf table
I wrote a chunk of code working to get the .pdf table I am interested in in R, but there must be a better way. Hence, I haven't a problem in importing the data from pdf. I am looking for a BETTER way than the following to extract the tables I am…

Helena
- 87
- 9
1
vote
0 answers
iOS Swift PDFDocument, Turkish Characters Broken Export
Problem: Some PDF export string broken Turkish Chars.
Sample.pdf // Orijinal Content “ İzmir, çanakkale, kaş, ırmak, bağlıca, çin”
Example;
let document = PDFDocument.init("sample.pdf")
print(document?.string) // Output : zmir anakkale kaş, rmak, b…

redsponge
- 11
- 1
1
vote
3 answers
AttributeError: 'PDFPage' object has no attribute 'extractText'
I am trying to extract the content from a PDF in order to create an excel sheet out of it.
What I tried
import pdfquery
pdf = pdfquery.PDFQuery('C:\\Users\\Santosh\\Downloads\\2017-San-Jamar-
Price-List-US-Z120913E-RevA.pdf')
page =…

Santosh
- 103
- 2
- 4
- 13
1
vote
3 answers
iTextSharp extracts wrapped cell contents into new lines - how do you identify to which column a given wrapped piece of data belongs now?
I am using iTextSharp to extract data from pdfs.
I stumbled across the following problem, depicted by the scenario below:
I created a sample excel file to illustrate. Here is what it looks like:
I convert it to a pdf, using one of the many free…

Veverke
- 9,208
- 4
- 51
- 95
1
vote
0 answers
PDFMiner incorrectly stacks list data?
I am trying to extract information out of a PDF using PDFMiner in a consistent manner so I can do further analysis but I can't figure out how to correctly extract tabular data. PDF Miner seems to extract columns before rows. Has anyone solved this…

Yaegz
- 669
- 6
- 15
1
vote
1 answer
How can I get max fontsize of a pdf using pdfbox
I use pdfbox extraction for some information from a pdf, but how can I extract every objects information? If one of them contains the stream, how can I decode the stream to display?
Can I get the maximum fontsize from a pdf box? I think if I can…

dock
- 11
- 2
1
vote
1 answer
Can't get the texts' real fonts with itext?
I have been try to extract text from pdf and thanks to itext i can extract whole text. However, i am trying to detect headings' fonts and by using this info i am planning to extract only those texts between two specific headings. For example in a…

mlee_jordan
- 772
- 4
- 18
- 50
1
vote
0 answers
gem install of pdf-extract on Macports / Mac OS X Yosemite
I am attempting to install pdf-extract on Mac OS X Yosemite. I assume it's better not to use the /usr/bin/ruby that comes with Yosemite, so I'm using the Macports version, /opt/local/bin/ruby (ver2.1.3).
The installation appears to go fine:
sudo…

nathanielng
- 1,645
- 1
- 19
- 30
0
votes
1 answer
Pdf parse to text using java
I have the same problem of extracting arabic text from pdf File,
can any one help if got the solution ???
I have tried many times with pdfbox but no result.

Ouni Chafika
- 9
- 2
0
votes
0 answers
Facing issue in extracting Tables from PDF with tabula
I am trying to extract multiple tables from the PDF which is throwing me Command '['java', '-Dfile.encoding=UTF8', ERROR
link to the pdf
https://www.paypalobjects.com/marketing/web/US/en/merchant_fees/US-merchant-fees-24-July-2023.pdf
PDF has 42…

user21766269
- 19
- 2
0
votes
0 answers
How to extract header, paragraph, table structure from pdf using azure form recognizer in python
I would like to extract the data like Header, paragraphs, tables, pagenumber, pagefooter from the pdf in the dataframe format using the azure form recognizer using python.
PFB expected output.
I have tried using layout model but the from the…

Niranjanp
- 301
- 2
- 5
- 15