-1

Many tools have the feature of printing a particular page or a continous range of selective pages.

Similarly, we might require to see only few pages of the particular document that is available in internet. So,rather than downloading the complete document, it is better to have only those few selective pages from that document. Is there a tool/protocol that will help in downloading a particular page of the document rather than the whole document(PDF or Word document or Linux based document files or PPT file or Excel file) ?

It would be even more helpful if the tool has the feature of downloading multiple random/sequential selective pages from the same document !

I am really surprised that download managers also do not support this feature !!

I think there are many advantages like quick dowload of the particular user's desired contents only and saving of bandwidth.

Any specific reasons for the lack of support for this feature in many of the file transfer/sharing tools/protocols ?

Any such tool/protocol available in either Linux or Windows environment ? Any ideas ?

Thx in advans, Karthik Balaguru

3 Answers3

2

What you're asking for is an HTTP or FTP server that is application-aware. This would require that the web server has the ability to interpret every document type desired. PDF, Word... oh wait, which version? Word XP? 2000? 2003? .doc or .docx?

You may be able to find a separate application that will perform this function dynamically on your web server, but it's going to eat up resources. It's true that this would save bandwidth - however I expect that the processing resources required on the server to accomplish this would far exceed the bandwidth cost of just sending the entire file.

Kara Marfia
  • 7,892
  • 5
  • 33
  • 57
  • Okay, But, No such interpretors ? I dont think that the tool should open the file to interpret. It should interpret pages by parsing memory layout(hex format) (Just a guess) ! Is it not possible that way ? If that file is located in address 0x20000000 of the server, then assume that page 10 is in location 0x20001000. To print page, it is just as easy as going directly to 0x20001000 location(direct access) and sending the contents till the end of page is present. May be the client can check for the end of page character and intimate the server if that is too much of a burden on the server. – Karthik Balaguru Jan 15 '10 at 13:03
0

Okay, I dont think that the tool should open the file to interpret. It should interpret pages by parsing memory layout(hex format) (Just a guess) ! Is it not possible that way ?

If that file is located in address 0x20000000 of the server, then assume that page 10 is in location 0x20001000. To download page, it is just as easy as going directly to 0x20001000 location(direct access) and sending the contents . May be the client can check for the end of page character and intimate the server. This might help in reducing the burden on the server. Other than that, it is the usual function of sending data from the server. Is it not this way ?

  • Have you ever tried creating a short document in MS Word .doc format, saving it, editing it a bit, and then opening that in Notepad? The actual text that you wrote is all over the file, with all the formatting and meta data stored in other places. Sometimes the pages or parts of pages aren't even in order inside the actual file. – GAThrawn Jan 15 '10 at 13:41
  • Can't we determine the format of document from the first few lines(Mostly called as header of file) of the file format ? (I am just guessing !). If the format is determined that we can easily get the corresponding offsets and other infos in other different areas as per the standard. Is it not possible that way ? Can you show me a model file format here or a link that will show the file format and the complexity in parsing it ! – Karthik Balaguru Jan 15 '10 at 20:31
0

This can't be achieved without the server being application aware (as Kara Marfia implied). You can't simply jump to the middle of the file and assume that will be the middle of the document since most data formats are not structure that way.

Take, for example, an OpenOffice Writer document (I use this example because I know a little bit about the format, but you get similar problems with other formats).

The text content of the file appears in one small chunk (and is wrapped with meta data).

A different part of the file contains the meta data (such as the name of the author)

Elsewhere is the information about how the content should be styled.

And there is a bunch of other data floating around in the file too.

So the data is arranged in a non-linear fashion. It is then compressed — so even the parts of the file that are linear get split up.

(The above is simplified. Run unzip on an odt file and you can see the structure for yourself.)

Most document formats are not simple, linear representations of how a document should be rendered. You can't just snip them into pieces and extract the parts you care about.

Quentin
  • 1,157
  • 6
  • 10
  • Can you paste the format here so that it could be visualized . Can you share a link that shows few file formats to illustrate the complexity. I think, every file format will have certain header and body and other blocks. The client is going to tell to the server that he needs the XYZ.doc or ABC.ppt . So, in the request itself the file formats are conveyed to the server. The servers job is to just to go the particular offset and start sending the data with its knowledge in file formats. Is going to particular offset and sending the particular page a difficult task ? – Karthik Balaguru Jan 15 '10 at 20:36
  • "Can you paste the format here so that it could be visualized" — frankly, no. After decompressing it, a minimal odt is 44k, that is way too much to paste. OOO is free. You can make a file, unzip it and look for yourself. The format is probably documented on the OOO website. – Quentin Jan 15 '10 at 21:12
  • "Can you share a link that shows few file formats" - http://www.google.co.uk/search?q=jpeg%20specification (and replace 'jpeg' with a variety of other formats that interest you) – Quentin Jan 15 '10 at 21:13
  • "The servers job is to just to go the particular offset and start sending the data with its knowledge in file formats." - as mentioned, there is no offset. The formats are NOT linear. The data needed to build any given fragment of a file is usually scattered throughout the file (unless it is a format designed explicitly for network seeking, such as most video formats used on the Internet). – Quentin Jan 15 '10 at 21:14
  • Okay, So, non-linearity is the bottleneck. I think the following can be opted (This may sound a bit crazy !). It may be better to have future revisions of file formats may need to think about bandwidth usage also. That is, bandwidth usage could be one of the input/requirement factor while designing file formats . If possible the existing file formats can release revised versions of the file formats as linear to support bandwidth also. I am not sure how much of this is possible for various file formats, but i think if not for the existing formats, the future formats should have linearity ! – Karthik Balaguru Jan 16 '10 at 00:52
  • If we continue to use the example of ODT files - that would require that masses of information be duplicated. It would prevent zip compression being used. So in an effort to save bandwidth when downloading part of a file, you end up making the entire file **much**, **much** larger. This is counterproductive. – Quentin Jan 16 '10 at 12:50