Questions tagged [pdf-extraction]

Extracting text and other data from a PDF document, regardless of the libraries used to achieve this.

148 questions
4
votes
1 answer

How to get background color of a Text in PyMuPDF

Am trying to see if I can identify possible table headers in a table inside PDF using background and foreground color of the text. With PyMuPDF text extraction, I was able to get the foreground color. Wondering if there is a way to get background…
Suvin K S
  • 229
  • 2
  • 8
4
votes
3 answers

Error while image extraction from PDF in python

I am trying to extract all formats of images from pdf. I did some googling and found this page on StackOverflow. I tried this code but I am getting this error: I am using python 3.x and here is the code I am using. I tried to go through comments…
john
  • 85
  • 2
  • 10
4
votes
1 answer

Extracting Text from a PDF with CID fonts

I'm writing a web app that extracts a line at the top of each page in a PDF. The PDFs come from different versions of a product and could go through a number of PDF printers, also in different versions and also different settings. So far using…
Red
  • 3,030
  • 3
  • 22
  • 39
4
votes
1 answer

Scrapy crawl data inside pdf file

I would like to know how to crawl data inside a pdf file using scrapy. Which module should I use and which is the best and effective way?? Could you please give me some sample tutorials on this Thanks!!
Dev Pandu
  • 121
  • 2
  • 12
4
votes
4 answers

iText - Get Font size and family of a text segment

I'm currently trying to automatically extract important keywords from a PDF file. I am able to get the text information out of the PDF document. But now I need to know, which font size and font family these keywords have. The following code I…
Prine
  • 12,192
  • 8
  • 40
  • 59
3
votes
1 answer

Extract data from pdf invoices of varying formats

The objective is to extract data out of invoices in pdf format. Pdf data format: selectable text (not scanned images) consists of lines of text, name-value pairs, tables (of varying lengths) Invoices data includes: invoice_no, invoice_date,…
3
votes
2 answers

Tabula-py omitting pages from a PDF document I am trying to extract

I am trying to extract tables from a multi-page PDF with tabula-py, and while the tables on some of the pages of the PDF are extracted perfectly, some pages are omitted entirely. The omissions seem to be random and don't follow any visible visual…
Sannita
  • 131
  • 1
  • 4
3
votes
1 answer

Huge white space after header in PDF using Flying Saucer

I am trying to export an HTML page into a PDF using Flying Saucer. For some reason, the pages have a large white space after the header (id = "divTemplateHeaderPage1") divisions. The jsFiddle link to my HTML code that is being used by PDF renderer:…
Sparks
  • 115
  • 1
  • 9
3
votes
3 answers

Counting the pages in a PDF file

I know of several tools/libraries that can do this but I want to know if this is possible with just opening up the file as a text file and looking for a keyword.
Chry Cheng
  • 3,378
  • 5
  • 47
  • 79
2
votes
1 answer

RTL (Arabic) ligatures problem when extracting text from PDF

When extracting Arabic text from a PDF file using librairies like PyMuPDF or PDFMiner, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly…
Naourass Derouichi
  • 773
  • 3
  • 12
  • 38
2
votes
2 answers

How is the text from this pdf encoded?

I have some pdfs with data about machine parts and i am trying to extract sizes. I extracted the text from a pdf via pypdfium2. import pypdfium2 as pdfium pdf = pdfium.PdfDocument("myfile.pdf") page=pdf[1] textpage = page.get_textpage() Most of the…
HrkBrkkl
  • 613
  • 5
  • 22
2
votes
1 answer

Use pdfplumber to extract paragraphs

I'm using pdfplumber to extract text from a pdf. I'm able to extract lines of text, but I'm having trouble extracting a paragraph. Here's the current code I have. Example of text I want to extract: Paragraph Title Lorem ipsum dolor sit amet,…
Solana Liu
  • 45
  • 1
  • 1
  • 6
2
votes
0 answers

Extracting PDF tables with camelot-py (lattice): split_text does not work

When extracting a table using camelot, the text of two columns that is close together is merged into one, even though all lines are detected correctly. I am using the lattice flavor, as the table in the PDF has lines. I set split_text = True but it…
Tomper
  • 78
  • 7
2
votes
1 answer

Extracting comments/annotations from PDF sequentially - Python

I am trying to extract comments from a PDF using Python. These are the two pieces of code that I have tested: One using PyPDF2: import PyPDF2 src = 'xxxx.pdf' input1 = PyPDF2.PdfFileReader(open(src, "rb")) nPages = input1.getNumPages() df_comments…
Debadri Dutta
  • 1,183
  • 1
  • 13
  • 39
2
votes
1 answer

Camelot Cannot extract entire table

Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi). Camelot seems to be able to identify the table and extract most of the data within the table but it seems to be unable…
1
2
3
9 10