Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

You may extract the table directly using camelot PDF Table Extraction for Humans
You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
pdf2image with pytesseract and an example.

Related Questions:

177 questions

vote

1 answer

Class 'Smalot\PdfParser\Parser' not found

I am trying to use Pdfparser library to parse a PDF file but I have some issues with classes inclusion. I read the documentation but it doesn't works. I use Windows and XAMPP. I created a directory in /xampp/htdocs/pdf_import I installed Composer…

php pdf-parsing

asked Dec 24 '14 at 10:15

bit

vote

0 answers

Parsing pdf with itext?

I am having trouble getting consistent result using itext parser. This is the code public void parsePdf(String pdf) throws IOException { PdfReader reader = new PdfReader(pdf); Rectangle rect = new Rectangle(370,280, 380, 613); …

java itext pdf-parsing

asked Oct 24 '14 at 13:30

caniaskyouaquestion

vote

2 answers

Reading PDF Literal String parsing dilemma

I have the following contents in the same PDF page, in different ObjectX: First: [(some text)] TJ ET Q [(some other text)] TJ ET Q Very simple and basic so far... The second: [( H T M L E x a m p l e)] TJ ET Q [( S o m e s p e c i a l c h a r…

java pdf encoding character-encoding pdf-parsing

asked Oct 14 '14 at 00:47

TacB0sS

10,106
12
75
118

vote

3 answers

Which is best PDR parser?

I want to parse the tabular information from a .pdf file,and want to display that tabular information in a datagridview in C#. What choices do I have?

c# .net winforms pdf pdf-parsing

asked Mar 18 '10 at 09:42

Harikrishna

4,185
17
57
79

vote

0 answers

PHP: Parsed PDF-File full of Control-Characters

I've got a problem parsing this pdf-file: http://www.transperfect.com/sites/default/files/imported/pdf/Tokyo_Client_Services_Representative.pdf After I encoded the FlateDecode decoded pdf-file the output is something like this: Usually it's easy to…

php pdf control-characters pdf-parsing

asked May 02 '14 at 12:59

user3596202

vote

1 answer

PDF transformation matrix has a scaling of 50 units

I'm trying to highlight some text with a glyph width of 1000 (which corresponds to 1 unit of text space)and font size of 1; the transformation matrix is [50 0 0 50 0 0]. The result is text that is too big. But this is not the case. The text that is…

pdf pdf-parsing

asked Feb 12 '14 at 11:48

Diego A. Rincon

vote

2 answers

Font information of text in PDF using PDFBox

I am new to Apache PDFBox library. I want to map font information to the PDF paragraphs I have already gone through Questios How to extract font styles of text contents using pdfbox? But it doesn't give information about which paragraph is written…

java pdfbox text-extraction pdf-parsing

asked Nov 21 '13 at 07:32

Gaurav Singh

12,707
5
22
24

vote

0 answers

How to resolve pdf parsing error

scala code : val file = new File(path + name) val raf = new RandomAccessFile(file, "r") val channel = raf.getChannel() val buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size()) val pdffile = new PDFFile(buf) …

scala pdf pdf-parsing

asked Oct 22 '13 at 12:03

Rishi

1,279
3
20
37

vote

1 answer

Trying to annotate a PDF with XREF streams

I have this sample PDF file: Original file which I tried to attach a text annotation to, which resulted in this: Annotated file However, preview on MAC OSX still shows the document without the new annotation, where Adobe Reader can not even open the…

pdf pdf-generation pdf-parsing

asked Feb 20 '13 at 17:33

Hasib Samad

1,081
1
20
39

vote

1 answer

Mixing XRef Tables and XRef Streams

It is true you can not have common XRef tables and XRef streams in a PDF file? I thought this is what to be called a "hybrid PDF document"! Any idea?

pdf pdf-generation pdf-parsing

asked Feb 19 '13 at 16:15

Hasib Samad

1,081
1
20
39

vote

2 answers

parse pdf from url on java. can i use jsoup?

i have the url : http://pasca.undiksha.ac.id/e-journal/index.php/jurnal_bahasa/article/view/500 (it's not directly access pdf, but directed to pdf file. I want parse this pdf file and get pdf text. i try using jsoup : ` String url =…

java pdf jsoup pdf-parsing

asked Jan 29 '13 at 08:11

rey1024

vote

0 answers

Exporting embedded Adobe PDF Reader text

I have an embedded Adobe PDF Reader in my Windows application. When I open a certain PDF file I need to do is manually select a text in that PDF and transfer it over to a textbox. I haven't done much work with PDF embedded components. But I can see…

pdf text components pdf-parsing

asked Jun 14 '12 at 22:23

user1457387

votes

1 answer

Document AI to perform automatic research in large amount of data from pdf files

I need to add a feature for my app to allow my clients to extract text from image texts and parse them to usable data like json format and store them to then be able to perform better data research. Those image-texts are big pdf files (~150-500…

json parsing ocr cloud-document-ai pdf-parsing

asked Jul 13 '23 at 15:32

prime

votes

0 answers

How to convert, or read a .doc file with PHPWord?

I've crawled this and other websites and found no solutions to this: I'm trying to read the text from a .doc file using PHPOffice/PHPWord and all the code I've tried has failed. I can read .docx files just fine, it's just 97-03 Word documents that…

php docx phpword phpoffice pdf-parsing

asked Jul 10 '23 at 08:35

Grimcall

votes

0 answers

How to make PDFMiner Six detect bullet points (including alphanumeric bullets) when parsing documents?

I am currently using PDFMiner.six to parse documents for me, but would like it to be able to detect bullet points (including alphanumeric bullets like "a.", "i.", "1."). For now, it only treats them as characters, but I was wondering if I am missing…

python pdf package detection pdf-parsing

asked Jun 02 '23 at 15:03

belacile

Prev 1 2 3

…

11 12 Next