how to do photo and text extraction form an online pdf

Question

I know that there is already PDFbox and iText but they don't have the ability for visual content extraction as well as need to work offline with the pdf. withal, I want a way to do some text and visual content extraction online. do not want to download the pdf file and then do stuff. what kind of API or library is there for Java language?

EDIT for those who find it not clear, I explain some more:

Just imagine when using any HTML parser you can parse a page online, make the DOM or SAX tree and going through their elements and then extracting photos and text based on the content of the nodes in those trees. at least, for photos, you can get their corresponding HTML tags and for text, the same plus you can get actual text. now, I want to know if there is anything similar for doing with PDFs? going through text and images without downloading

PDFBox can extract text and images. And of course you will have to download the PDF. — Tilman Hausherr, Jan 25 '15 at 17:38
*visual content extraction* - explain what you mean, please. Furthermore there does not seem to be any sense in your online-offline explanation. — mkl, Jan 25 '15 at 21:43
@mkl Alright! if too hard for brain to handle it, I give you an example. Just imagine when using any `HTML parser` you can parse a page online, make the DOM or SAX tree and going through their elements and then extracting photos and text based on the content of the nodes in those trees. at least, for photos, you can get their corresponding HTML tags and for text, the same plus you can get actual text. now, I want to know if there is anything similar for doing with PDFs? going through text and images without downloading the PDF? — lonesome, Jan 26 '15 at 02:12
This may surprise you, but DOM and SAX do read the HTML file. And the photos on a web page (with the exception of some exotic things like "data:") are not part of the HTML at all, these are files. — Tilman Hausherr, Jan 26 '15 at 07:44
@lonesome *if too hard for brain to handle* - well, the brain knows that a html parser downloads the HTML before actually parsing it whenever it is asked to parse an online HTML, and photos (unless base64-URL encoded) are separate files. As you say you don't want to download, a HTML parser is a sample for what you ***don't* want.** — mkl, Jan 26 '15 at 08:37
@lonesome That been said, due to the special structure of PDF files, it indeed is not necessary to download the whole file to e.g. only retrieve the contents of a single page. To make work with partial file retrievals, though, the http server needs to support range requests. In case of static PDFs that already might be possible fairly often, but in case of dynamically (on request) generated PDFs that will hardly ever work. — mkl, Jan 26 '15 at 08:45
@mkl i cant remember when i used a html parser, it downloaded anything or creating any folders etc. I am quite sure it just did it all without downloading anything. for example if you ever used `Hotmail` to view an attached pdf in your mail, it will open it as word document online. you can select text and picture from it. without downloading it. I want something like that. to access the pdf on the site not on my HDD. I mean on the website that the pdf is already there. — lonesome, Jan 26 '15 at 09:26
Just because you have the UI experience that "it all happens in the browser", doesn't mean that the PDF isn't downloaded somehow. Either locally (e.g. with javascript) or that the server handles the file, converts it and then offsers parts of it in the browser. The scenario that @mkl describes is of course possible, but only for PDF files with a correct xref table. — Tilman Hausherr, Jan 26 '15 at 10:28
@lonesome *not on my HDD* - a download may be into memory, it does not need to be on disc. Working with in-memory representations is possible both in PDFBox and in iText, either directly or by means of memory-based streams. *Hotmail* - No, I have not used that service yet, so I don't know whether it actually displays the PDF or transforms the PDF to some other format on the fly which it then displays page by page. Nonetheless, to directly access a PDF, you need to download parts of it, and if the server does not support range requests, you need to download the whole PDF, to memory or to disc. — mkl, Jan 26 '15 at 10:41
@mkl finally got into a common point. how to load the pdf into memory? i mean, in pdfbox or whatever that can provide such thing? — lonesome, Jan 26 '15 at 10:54
@TilmanHausherr so, do you mean when ,for example,firefox opens the whole pdf in its viewer, it has been downloaded in my disk? — lonesome, Jan 26 '15 at 10:58
If the whole pdf can be seen, then yes, on the disk or in your memory. However I just looked at the source code of pdf.js, it seems like they are doing the range requests. https://github.com/mozilla/pdf.js/blob/master/src/core/chunked_stream.js A google search for "pdf.js range requests" also brings results. — Tilman Hausherr, Jan 26 '15 at 11:13
PDFBox loads an inputStream as a whole into a temp file, and get on from there because they need random access, and they don't do range requests. — Tilman Hausherr, Jan 26 '15 at 11:16
@TilmanHausherr so if I wanna do it on memory, after the execution of the program ends, the file will not be accessible? right? or should I write extra code for erasing it? — lonesome, Jan 26 '15 at 11:20
PDFBox will delete the temp file when you close the document. — Tilman Hausherr, Jan 26 '15 at 11:24
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/69590/discussion-between-lonesome-and-tilman-hausherr). — lonesome, Jan 26 '15 at 11:27
@TilmanHausherr *PDFBox loads an inputStream as a whole into a temp file* - wouldn't it be worth considering a PDFParser extension which accepts a byte array and sets `raStream` to a `RandomAccessBufferedFileInputStream` facade operating in memory on the array alone? That would allow use in contexts where the program has no file system access permissions. — mkl, Jan 26 '15 at 12:15
yes I had that thought too when going through the sources earlier... I'm ill now but I'll think about that again in a few days. — Tilman Hausherr, Jan 26 '15 at 12:18

score 0 · Answer 1 · answered Jan 28 '15 at 09:29

0

Gnostice PDFOne (for Java) has a getPageElements() method that can parse a PDF page for text and image elements. Text in a PDF is not in a DOM like a HTML or XML document. Text just appears in various x-y coordinates and magically looks well-formatted. However, PDFOne has some PDF text extraction methods that reconstruct those text elements to user-friendly sentences. DISCLOSURE: I work for the company that makes this library.

answered Jan 28 '15 at 09:29

gn1

526
2
5

"Text in a PDF is not in a DOM like a HTML or XML document." i know this but i meant if there is anyway to treat a pdf file like this. if possible not to download the whole pdf file and doing stuff like what i explained in memory or so. – lonesome Jan 29 '15 at 10:46
so, does this library provide such functionality? – lonesome Jan 29 '15 at 10:47
The getPageElements returns an array, which you can iterate through, like a DOM array. You can get all page elements in that array or just text elements or image elements or formfields or annotations. PDFOne can load a PDF from a memory stream or byte array. So, you need to load the online PDF into a memory stream or byte array. – gn1 Jan 30 '15 at 03:21
oh, that sounds sweet. i couldnt find any specific documentation on the site. can you show me where I can get it? and for free version, am I allowed to do these image and text extraction and memory stream stuff? – lonesome Jan 30 '15 at 03:25
The link to getPageElements shows you how to iterate through page elements. The Free version of this library was release long ago and I don't think it has the getPageElements function. – gn1 Jan 30 '15 at 05:07
ooh, that is such a pity. – lonesome Jan 30 '15 at 05:25

score -1 · Answer 2 · answered Jan 25 '15 at 10:33

-1

PDFImageStream can do that. There is a free version with only one restriction: it can only be used in single-threaded applications.

answered Jan 25 '15 at 10:33

atao

835
6
13

I took a quick look at it. does it do the image and text extraction online? – lonesome Jan 25 '15 at 10:42
What do you mean? Give a scenario, – atao Jan 25 '15 at 17:06
PDF doesn't work as HTML. With the latter, everything is a link. So the data of a picture are (almost! See eg favicons inlined) never inside a HTML document. With the former, quite everything is "embedded". You can't get a document without also fetching the data of the pictures displayed by it. – atao Jan 26 '15 at 07:11
Following the precedint comment. HTML is designed to be viewed online. PDF is designed to be standalone (even the fonts can be embedded). – atao Jan 26 '15 at 07:19
Following the two preceding comments. Actually PDF allows image data to be stored in external files using external streams or Alternate Images. But it's quite unusual. – atao Jan 26 '15 at 07:31
if you ever used Hotmail to view an attached pdf in your mail, it will open it as word document online. you can select text and picture from it. without downloading it. I want something like that. to access the pdf on the site not on my HDD. I mean on the website that the pdf is already there... the library you suggested extract the image with a big logo in the middle of it which makes it quite useless! – lonesome Jan 26 '15 at 09:29

how to do photo and text extraction form an online pdf

2 Answers2