I have a PDF document that I need to read data from. What I discovered is that when I convert said PDF to an XML document, there are convenient tags I can read from in there and so I need a way to in code, convert my files to xml, so I can, using mapper files, read the data content to database.
Asked
Active
Viewed 4,284 times
3
-
My "too broad" sense is tingling. Could you try to clarify your question? What sort of PDF file do you have and what do you need to extract from it into what sort of XML? Are you stuck on some specific part of this task? – millimoose Jun 21 '12 at 22:08
-
A PDF file. I need to extract data from a bunch of pdf documents. Now they are not formatted in any standard way but I know some of them are generated using Microsoft excel, while others are not. But I want to convert them to XML, since I believe XML is easier to manipulate. – Kobojunkie Jun 21 '12 at 22:22
-
Well, I am kind of stuck. I don't know what classes in IText will enable me convert the Pdf documents to Xml on the fly. From the examples and information I have gleaned so far, seems there is more on converting XML /HTML to PDf, which is opposite of what I want. – Kobojunkie Jun 21 '12 at 22:23
-
Googling for "iText extraction" gives me a bunch of results, including this one which seems to be tutorial-level: http://what-when-how.com/itext-5/parsing-pdfs-part-2-itext-5/ . This part of the API docs is probably also relevant: http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/package-summary.html . Last but not least, check the iText in Action book: http://www.manning.com/lowagie/ . (Actually, the book is what you should check *first* for iText questions.) – millimoose Jun 21 '12 at 22:30
-
Also, be aware that extracting text from PDFs is very very fiddly. There's a significant probability it might end up to not be worth the required effort. – millimoose Jun 21 '12 at 22:31
-
I am not trying to extract text from PDFs. I am trying to convert PDF to XML, which I believe is better for extraction purposes. The Manning book does not contain examples of how to do this, as far as I know. – Kobojunkie Jun 21 '12 at 22:50
-
Page 513 in my PDF, using the [`TaggedPdfReaderTool`](http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TaggedPdfReaderTool.html) that's in the package the API docs which I've linked above. And past one link in the third thing I linked, mostly because it seems to be a pirated copy of iText in Action put online. – millimoose Jun 21 '12 at 23:12
-
And if your original PDFs aren't tagged, you'll necessarily have to do text extraction. – millimoose Jun 21 '12 at 23:13
-
possible duplicate of [pdf to xml conversion using .NET](http://stackoverflow.com/questions/6287880/pdf-to-xml-conversion-using-net) – Greg Hewgill Jun 22 '12 at 02:25
-
This is a possible dublicate of [this](http://stackoverflow.com/questions/6287880/pdf-to-xml-conversion-using-net) stackoverflow question. Anyways, check out the answers in that post. – SuperPrograman Jun 22 '12 at 02:21
1 Answers
0
Use PDFMiner
PDFMiner is a tool for extracting information from PDF documents. It includes a PDF converter that can transform PDF files into other text formats (such as XML/HTML).
Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines.
It has an extensible PDF parser that can be used for other purposes than text analysis.

codingscientist
- 1,086
- 1
- 11
- 12