Questions tagged [pdf-extraction]

Extracting text and other data from a PDF document, regardless of the libraries used to achieve this.

148 questions
1
vote
2 answers

Remove whitespace from PDF Document

I am using Camelot-py to read and extract attributes from several PDFs. I use table_areas to extract some of the attributes and I am facing difficulties in setting the correct areas, due to the deviation in X or Y co-ordinates between some of the…
A.A. F
  • 349
  • 5
  • 16
1
vote
1 answer

Node.js - Problem to extract text from PDF file using Google Cloud Vision API

I'm new to cloud environments and programming in general, and I'm struggling to use the Google Vision API to extract text from a PDF file located in a remote bucket. I've found it really difficult to get meaningful content related to this subject in…
1
vote
1 answer

Extracting specific segments from PDF document

I have a few research papers in pdf format and I want to extract just the introduction/background etc from the paper. also, I can only use python. can someone please help?
Cheryl
  • 27
  • 1
  • 12
1
vote
0 answers

getting java.lang.ClassNotFoundException: org.apache.pdfbox.exceptions.CryptographyException when using Lucene-PDFbox jar

When I am running this code, I am getting following Exception. Running fine with only PDFBox jar. getting this exception Lucene-PDFBox jar only. import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import…
monty
  • 108
  • 1
  • 2
  • 7
1
vote
1 answer

Extracting .pdf table

I wrote a chunk of code working to get the .pdf table I am interested in in R, but there must be a better way. Hence, I haven't a problem in importing the data from pdf. I am looking for a BETTER way than the following to extract the tables I am…
Helena
  • 87
  • 9
1
vote
0 answers

iOS Swift PDFDocument, Turkish Characters Broken Export

Problem: Some PDF export string broken Turkish Chars. Sample.pdf // Orijinal Content “ İzmir, çanakkale, kaş, ırmak, bağlıca, çin” Example; let document = PDFDocument.init("sample.pdf") print(document?.string) // Output : zmir anakkale kaş, rmak, b…
redsponge
  • 11
  • 1
1
vote
3 answers

AttributeError: 'PDFPage' object has no attribute 'extractText'

I am trying to extract the content from a PDF in order to create an excel sheet out of it. What I tried import pdfquery pdf = pdfquery.PDFQuery('C:\\Users\\Santosh\\Downloads\\2017-San-Jamar- Price-List-US-Z120913E-RevA.pdf') page =…
Santosh
  • 103
  • 2
  • 4
  • 13
1
vote
3 answers

iTextSharp extracts wrapped cell contents into new lines - how do you identify to which column a given wrapped piece of data belongs now?

I am using iTextSharp to extract data from pdfs. I stumbled across the following problem, depicted by the scenario below: I created a sample excel file to illustrate. Here is what it looks like: I convert it to a pdf, using one of the many free…
Veverke
  • 9,208
  • 4
  • 51
  • 95
1
vote
0 answers

PDFMiner incorrectly stacks list data?

I am trying to extract information out of a PDF using PDFMiner in a consistent manner so I can do further analysis but I can't figure out how to correctly extract tabular data. PDF Miner seems to extract columns before rows. Has anyone solved this…
Yaegz
  • 669
  • 6
  • 15
1
vote
1 answer

How can I get max fontsize of a pdf using pdfbox

I use pdfbox extraction for some information from a pdf, but how can I extract every objects information? If one of them contains the stream, how can I decode the stream to display? Can I get the maximum fontsize from a pdf box? I think if I can…
dock
  • 11
  • 2
1
vote
1 answer

Can't get the texts' real fonts with itext?

I have been try to extract text from pdf and thanks to itext i can extract whole text. However, i am trying to detect headings' fonts and by using this info i am planning to extract only those texts between two specific headings. For example in a…
mlee_jordan
  • 772
  • 4
  • 18
  • 50
1
vote
0 answers

gem install of pdf-extract on Macports / Mac OS X Yosemite

I am attempting to install pdf-extract on Mac OS X Yosemite. I assume it's better not to use the /usr/bin/ruby that comes with Yosemite, so I'm using the Macports version, /opt/local/bin/ruby (ver2.1.3). The installation appears to go fine: sudo…
nathanielng
  • 1,645
  • 1
  • 19
  • 30
0
votes
1 answer

Pdf parse to text using java

I have the same problem of extracting arabic text from pdf File, can any one help if got the solution ??? I have tried many times with pdfbox but no result.
0
votes
0 answers

Facing issue in extracting Tables from PDF with tabula

I am trying to extract multiple tables from the PDF which is throwing me Command '['java', '-Dfile.encoding=UTF8', ERROR link to the pdf https://www.paypalobjects.com/marketing/web/US/en/merchant_fees/US-merchant-fees-24-July-2023.pdf PDF has 42…
0
votes
0 answers

How to extract header, paragraph, table structure from pdf using azure form recognizer in python

I would like to extract the data like Header, paragraphs, tables, pagenumber, pagefooter from the pdf in the dataframe format using the azure form recognizer using python. PFB expected output. I have tried using layout model but the from the…
Niranjanp
  • 301
  • 2
  • 5
  • 15