5

Simply put our system consists of a Server and an Agent. The Agent generates a huge binary file, which may be required to be transfered to the Server.

Given:

  1. The system must cope with files up to 1G now, which is likely to grow to 10G in 2 years
  2. The transfer must be over HTTP, because other ports may be closed.
  3. This is not a file sharing system - the Agent just need to push the file to the Server.
  4. Both the Agent and the Server are written in Java.
  5. The binary file may contain sensitive information, so the transfer must be secure.

I am looking for techniques and libraries to help me with transfering huge files. Some of the topics, which I am aware of are:

  • Compression Which one to choose? We do not limit ourselves to gzip or deflate, just because they are the most popular for HTTP traffic. If there is some unusual compression scheme, which yields better results for our task - so be it.
  • Splitting Obviously, the file needs to be split and transfered in several parallel sessions.
  • Background Transfering a huge file takes a long time. Does it affect the solution, if at all?
  • Security Is HTTPS the way to go? Or should we take another approach, given the volume of data?
  • off-the-shelf I am fully prepared to code it myself (should be fun), but I cannot avoid the question whether there are any off-the-shelf solutions satisfying my demands.

Has anyone encountered this problem in their products and how was it dealt with?

Edit 1

Some may question the choice of HTTP as the transfer protocol. The thing is that the Server and the Agent may be quite remoted from each other, even if located in the same corporate network. We have already faced numerous issues related to the fact that customers keep only HTTP ports open on the nodes in their corporate networks. It does not leave us much choice, but use HTTP. Using FTP is fine, but it will have to be tunneled through HTTP - does it mean we still have all the benefits of FTP or will it cripple it to the point where other alternatives are more viable? I do not know.

Edit 2

Correction - HTTPS is always open and sometimes (but not always) HTTP is open as well. But that is it.

halfer
  • 19,824
  • 17
  • 99
  • 186
mark
  • 59,016
  • 79
  • 296
  • 580
  • HTTP's a real bad choice for this. Just compress the source with whatever compression software works best for your data, and use a file transfer protocol/tool (of which there are hundreds available, a bunch with encryption available, some with parallel transfer caps) – Mat Dec 25 '11 at 08:44
  • If you tunnel FTP over HTTP, you get all the problems of HTTP **and** all the problems of FTP. It's even worse, don't do it. From your description, you've got large volumes of sensitive information - that's supposed to be high value. If your customers don't want to have ports for secure file transfers in this scenario and prefer opening ports for a plain text, session-less, unsecured protocol that's not meant for the purpose they need, well, can't do much for you. – Mat Dec 25 '11 at 08:58
  • Oops, I have misled you. HTTPS is always open. Simetimes HTTP is open as well, but sometimes it is just the HTTPS. – mark Dec 25 '11 at 09:11
  • Hi @mark did I am sure you already solved the problem 6 years on. What was your final solution to this ? – kimathie Nov 18 '18 at 01:02
  • Ouch, I have no recollection whatsoever :-) – mark Nov 18 '18 at 02:04

1 Answers1

3

You can use any protocol on port 80. Using HTTP is a good choice, but you don't have to use it.

Compression Which one to choose? We do not limit ourselves to gzip or deflate, just because they are the most popular for HTTP traffic. If there is some unusual compression scheme, which yields better results for our task - so be it.

The best compression depends on the content. I would use Deflator for simplicity, however BZIP2 can give better results (requires a library)

For your file type you may find doing some compression specific to that type first, can make the data sent smaller.

Splitting Obviously, the file needs to be split and transfered in several parallel sessions.

This is no obvious to me. Downloading data in parallel improves performance by grabbing more of the available bandwidth (i.e. squeezing out other users of the same bandwidth) This may be undesirable or even pointless (if there are no other users)

Background Transfering a huge file takes a long time. Does it affect the solution, if at all?

You will want the ability to re-start the download at any point.

Security Is HTTPS the way to go? Or should we take another approach, given the volume of data?

I am sure its fine, regardless of the volume of data.

off-the-shelf I am fully prepared to code it myself (should be fun), but I cannot avoid the question whether there are any off-the-shelf solutions satisfying my demands.

I would try using existing web servers to see if they are up to the job. I would be surprised if there isn't a free web server which does all the above.

Here is a selection http://www.java-sources.net/open-source/web-servers

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • What about networks that do content inspection? In this case there would be a false alarm if you use your custom protocol over port 80. – rit Dec 25 '11 at 09:17
  • In that case it could reject HTTPS which is based on HTTP but isn't exactly the same. You may have to do more to fool the content inspection. – Peter Lawrey Dec 25 '11 at 09:25
  • Are your files really compressible? Given the volume of data, compressing will take a huge amount of time. If you know your files are text-based then fine, but if not you should not even bother. – fge Dec 25 '11 at 10:20
  • 1
    @Peter - I disagree with your statement that a custom protocol can be used on port 80. We are not supposed to use ports under 1024 for our apps and the port 80 is reserved for HTTP. I am not going to test the robustness of all the network hops by giving unexpected protocols where HTTP is universally expected. HTTPS is using a different port, so no contradiction here. BTW, using HTTPS over the port 80 seems like a very bad idea either. – mark Dec 25 '11 at 11:44