Highest Voted 'pdf-extraction' Questions

4

votes

1 answer

How to get background color of a Text in PyMuPDF

Am trying to see if I can identify possible table headers in a table inside PDF using background and foreground color of the text. With PyMuPDF text extraction, I was able to get the foreground color. Wondering if there is a way to get background…

python pdf-extraction pymupdf

asked Sep 26 '19 at 06:30

Suvin K S

229
2
8

4

votes

3 answers

Error while image extraction from PDF in python

I am trying to extract all formats of images from pdf. I did some googling and found this page on StackOverflow. I tried this code but I am getting this error: I am using python 3.x and here is the code I am using. I tried to go through comments…

python python-imaging-library pypdf pdf-extraction

asked Dec 09 '17 at 17:01

john

85
2
10

4

votes

1 answer

Extracting Text from a PDF with CID fonts

I'm writing a web app that extracts a line at the top of each page in a PDF. The PDFs come from different versions of a product and could go through a number of PDF printers, also in different versions and also different settings. So far using…

pdf fonts itext pdfsharp pdf-extraction

asked Oct 29 '15 at 11:59

Red

3,030
3
22
39

4

votes

1 answer

Scrapy crawl data inside pdf file

I would like to know how to crawl data inside a pdf file using scrapy. Which module should I use and which is the best and effective way?? Could you please give me some sample tutorials on this Thanks!!

python python-2.7 pdf scrapy pdf-extraction

asked Jul 08 '15 at 09:10

Dev Pandu

121
2
12

4

votes

4 answers

iText - Get Font size and family of a text segment

I'm currently trying to automatically extract important keywords from a PDF file. I am able to get the text information out of the PDF document. But now I need to know, which font size and font family these keywords have. The following code I…

java pdf itext text-extraction pdf-extraction

asked Jun 04 '12 at 09:48

Prine

12,192
8
40
59

3

votes

1 answer

Extract data from pdf invoices of varying formats

The objective is to extract data out of invoices in pdf format. Pdf data format: selectable text (not scanned images) consists of lines of text, name-value pairs, tables (of varying lengths) Invoices data includes: invoice_no, invoice_date,…

pdf data-extraction pdf-extraction

asked May 15 '20 at 19:21

Amit Bhagat

61
7

3

votes

2 answers

Tabula-py omitting pages from a PDF document I am trying to extract

I am trying to extract tables from a multi-page PDF with tabula-py, and while the tables on some of the pages of the PDF are extracted perfectly, some pages are omitted entirely. The omissions seem to be random and don't follow any visible visual…

python pdf tabula pdf-extraction

asked Jul 29 '18 at 23:46

Sannita

131
1
4

3

votes

1 answer

Huge white space after header in PDF using Flying Saucer

I am trying to export an HTML page into a PDF using Flying Saucer. For some reason, the pages have a large white space after the header (id = "divTemplateHeaderPage1") divisions. The jsFiddle link to my HTML code that is being used by PDF renderer:…

java html itext flying-saucer pdf-extraction

asked Dec 16 '17 at 12:14

Sparks

115
1
9

3

votes

3 answers

Counting the pages in a PDF file

I know of several tools/libraries that can do this but I want to know if this is possible with just opening up the file as a text file and looking for a keyword.

pdf pdf-extraction

asked Oct 05 '10 at 06:45

Chry Cheng

3,378
5
47
79

2

votes

1 answer

RTL (Arabic) ligatures problem when extracting text from PDF

When extracting Arabic text from a PDF file using librairies like PyMuPDF or PDFMiner, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly…

python pdfminer pymupdf bidi pdf-extraction

asked Jan 30 '23 at 03:41

Naourass Derouichi

773
3
12
38

2

votes

2 answers

How is the text from this pdf encoded?

I have some pdfs with data about machine parts and i am trying to extract sizes. I extracted the text from a pdf via pypdfium2. import pypdfium2 as pdfium pdf = pdfium.PdfDocument("myfile.pdf") page=pdf[1] textpage = page.get_textpage() Most of the…

python encoding pdf-extraction

asked Nov 22 '22 at 15:09

HrkBrkkl

613
5
22

2

votes

1 answer

Use pdfplumber to extract paragraphs

I'm using pdfplumber to extract text from a pdf. I'm able to extract lines of text, but I'm having trouble extracting a paragraph. Here's the current code I have. Example of text I want to extract: Paragraph Title Lorem ipsum dolor sit amet,…

python pdf-extraction pdfplumber

asked Feb 15 '22 at 00:28

Solana Liu

45
1
1
6

2

votes

0 answers

Extracting PDF tables with camelot-py (lattice): split_text does not work

When extracting a table using camelot, the text of two columns that is close together is merged into one, even though all lines are detected correctly. I am using the lattice flavor, as the table in the PDF has lines. I set split_text = True but it…

python python-camelot pdf-extraction

asked Oct 15 '21 at 12:08

Tomper

78
7

2

votes

1 answer

Extracting comments/annotations from PDF sequentially - Python

I am trying to extract comments from a PDF using Python. These are the two pieces of code that I have tested: One using PyPDF2: import PyPDF2 src = 'xxxx.pdf' input1 = PyPDF2.PdfFileReader(open(src, "rb")) nPages = input1.getNumPages() df_comments…

python pypdf pdf-extraction

asked Jul 06 '21 at 07:33

Debadri Dutta

1,183
1
13
39

2

votes

1 answer

Camelot Cannot extract entire table

Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi). Camelot seems to be able to identify the table and extract most of the data within the table but it seems to be unable…

python pdf-extraction python-camelot pdftables ocrmypdf

asked Jun 26 '21 at 14:58

Douglas Griffin

21
1

Questions tagged [pdf-extraction]