Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

  • You may extract the table directly using camelot PDF Table Extraction for Humans
  • You may treat the pdf directly using tabula
  • You may convert the pdf to text using pdftotext, then parse text with python
  • You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
  • You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
  • pdf2image with pytesseract and an example.

Related Questions:

177 questions
1
vote
1 answer

Class 'Smalot\PdfParser\Parser' not found

I am trying to use Pdfparser library to parse a PDF file but I have some issues with classes inclusion. I read the documentation but it doesn't works. I use Windows and XAMPP. I created a directory in /xampp/htdocs/pdf_import I installed Composer…
bit
  • 427
  • 1
  • 6
  • 14
1
vote
0 answers

Parsing pdf with itext?

I am having trouble getting consistent result using itext parser. This is the code public void parsePdf(String pdf) throws IOException { PdfReader reader = new PdfReader(pdf); Rectangle rect = new Rectangle(370,280, 380, 613); …
caniaskyouaquestion
  • 657
  • 2
  • 11
  • 21
1
vote
2 answers

Reading PDF Literal String parsing dilemma

I have the following contents in the same PDF page, in different ObjectX: First: [(some text)] TJ ET Q [(some other text)] TJ ET Q Very simple and basic so far... The second: [( H T M L E x a m p l e)] TJ ET Q [( S o m e s p e c i a l c h a r…
TacB0sS
  • 10,106
  • 12
  • 75
  • 118
1
vote
3 answers

Which is best PDR parser?

I want to parse the tabular information from a .pdf file,and want to display that tabular information in a datagridview in C#. What choices do I have?
Harikrishna
  • 4,185
  • 17
  • 57
  • 79
1
vote
0 answers

PHP: Parsed PDF-File full of Control-Characters

I've got a problem parsing this pdf-file: http://www.transperfect.com/sites/default/files/imported/pdf/Tokyo_Client_Services_Representative.pdf After I encoded the FlateDecode decoded pdf-file the output is something like this: Usually it's easy to…
1
vote
1 answer

PDF transformation matrix has a scaling of 50 units

I'm trying to highlight some text with a glyph width of 1000 (which corresponds to 1 unit of text space)and font size of 1; the transformation matrix is [50 0 0 50 0 0]. The result is text that is too big. But this is not the case. The text that is…
Diego A. Rincon
  • 747
  • 1
  • 8
  • 25
1
vote
2 answers

Font information of text in PDF using PDFBox

I am new to Apache PDFBox library. I want to map font information to the PDF paragraphs I have already gone through Questios How to extract font styles of text contents using pdfbox? But it doesn't give information about which paragraph is written…
Gaurav Singh
  • 12,707
  • 5
  • 22
  • 24
1
vote
0 answers

How to resolve pdf parsing error

scala code : val file = new File(path + name) val raf = new RandomAccessFile(file, "r") val channel = raf.getChannel() val buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size()) val pdffile = new PDFFile(buf) …
Rishi
  • 1,279
  • 3
  • 20
  • 37
1
vote
1 answer

Trying to annotate a PDF with XREF streams

I have this sample PDF file: Original file which I tried to attach a text annotation to, which resulted in this: Annotated file However, preview on MAC OSX still shows the document without the new annotation, where Adobe Reader can not even open the…
Hasib Samad
  • 1,081
  • 1
  • 20
  • 39
1
vote
1 answer

Mixing XRef Tables and XRef Streams

It is true you can not have common XRef tables and XRef streams in a PDF file? I thought this is what to be called a "hybrid PDF document"! Any idea?
Hasib Samad
  • 1,081
  • 1
  • 20
  • 39
1
vote
2 answers

parse pdf from url on java. can i use jsoup?

i have the url : http://pasca.undiksha.ac.id/e-journal/index.php/jurnal_bahasa/article/view/500 (it's not directly access pdf, but directed to pdf file. I want parse this pdf file and get pdf text. i try using jsoup : ` String url =…
rey1024
  • 99
  • 3
  • 8
1
vote
0 answers

Exporting embedded Adobe PDF Reader text

I have an embedded Adobe PDF Reader in my Windows application. When I open a certain PDF file I need to do is manually select a text in that PDF and transfer it over to a textbox. I haven't done much work with PDF embedded components. But I can see…
0
votes
1 answer

Document AI to perform automatic research in large amount of data from pdf files

I need to add a feature for my app to allow my clients to extract text from image texts and parse them to usable data like json format and store them to then be able to perform better data research. Those image-texts are big pdf files (~150-500…
prime
  • 25
  • 4
0
votes
0 answers

How to convert, or read a .doc file with PHPWord?

I've crawled this and other websites and found no solutions to this: I'm trying to read the text from a .doc file using PHPOffice/PHPWord and all the code I've tried has failed. I can read .docx files just fine, it's just 97-03 Word documents that…
0
votes
0 answers

How to make PDFMiner Six detect bullet points (including alphanumeric bullets) when parsing documents?

I am currently using PDFMiner.six to parse documents for me, but would like it to be able to detect bullet points (including alphanumeric bullets like "a.", "i.", "1."). For now, it only treats them as characters, but I was wondering if I am missing…