How to extract text from pdf and doc file without downloading

Question

I have searched a lot before asking that question. I have a program(java) which crawls some wep pages and trying to find some .doc and .pdf files and it can download them but only one .pdf or .doc can cover up to 3-4mb which is not good because there are millions of files.. so I decied to extract their text without downloading the whole file. Basically, I need to see pdf or doc file online and download their text only but I could not figure out how to do that. If necessary I can provide my code.

Edit:This question can be closed now since I got the idea and (no)solution. Thanks for help.

And What's up with those downgrades on question ?

Reading a file from a website on the Internet without downloading it is impossible. If you have control of the server you could write a web service that can parse the files on demand and extract the parts you are interested in, which would then be sent to the client. — Jim Garrison, Feb 18 '16 at 08:18
but you can download them in advance and get a summary of each. During a consultation, you perform a search in your data. — Jean-Claude Colette, Feb 18 '16 at 08:24

score 2 · Accepted Answer · answered Feb 18 '16 at 08:17

2

That is not possible. You can only start extracting the document once you download the bytes.

(unless you also have control over the server, you could do the extraction server-side and provide a txt download link)

answered Feb 18 '16 at 08:17

Rob Audenaerde

19,195
10
76
121

I do not have a control on servers. I'm crawling thousands of web sites and get files.. so I have to download files. It is not going to be a efficient program but thanks anyway. – Y.Kaan Yılmaz Feb 18 '16 at 08:20
@kaanyılmaz Yes it will be inefficient. You could extract the files while downloading to prevent having to save them. But that is the best you can get I'm afraid. – Rob Audenaerde Feb 18 '16 at 08:22
1

once I download a file, I'll extract the text and get rif of the file. That's my only idea for now. – Y.Kaan Yılmaz Feb 18 '16 at 08:23

score 1 · Answer 2 · answered Feb 18 '16 at 08:25

1

Reading a file from a website on the Internet without downloading it is impossible.

If you have control of the server you could write a web service that can parse the files on demand and extract the parts you are interested in, which would then be sent to the client.

If not, and if you're up for a more challenging problem, you could write an HTTP client that starts downloading the file and parses it on the fly, downloading only as much as you need to extract the part(s) you need. This might or might not be feasible (or worthwhile) depending on where in the files the "interesting" bits were located. If they're close to the beginning in most cases then you might be able to reduce the download size significantly.

A detailed explanation of how to accomplish this is probably beyond the guidelines for StackOverflow answer length.

answered Feb 18 '16 at 08:25

Jim Garrison

85,615
20
155
190

I have no idea which parts of files that's why I'm downloading the whole file. and what's up with those downgrades.. I did not ask a silly or duplicate question .I just could not find any solution and asked to community since there were no question like that. – Y.Kaan Yılmaz Feb 18 '16 at 08:34
As to the downvotes, the way the question is written sounds like "I want to do X but don't know how" without an explanation of what approaches you have considered. We generally expect questioners to provide evidence of having done the research, just so others don't repeat the same work you've done. – Jim Garrison Feb 18 '16 at 08:38
I think it was like I want do X and Y, done X but Y so help me. – Y.Kaan Yılmaz Feb 18 '16 at 08:41
Help you in what way? We aren't here to write an entire application for you. If you don't think on-the-fly parsing and extraction will save siginficant bandwith the just download the files and process them. What else do you need help with? – Jim Garrison Feb 18 '16 at 08:45
I did not ask to write me an entire application. I just asked is there any way to do that.Have a good day. – Y.Kaan Yılmaz Feb 18 '16 at 08:48
If you explain specifically what you need help with we can try to help. – Jim Garrison Feb 18 '16 at 08:49

How to extract text from pdf and doc file without downloading

2 Answers2