3

Consider this code:

package com.zip;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpHead;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.RandomAccessFile;
import java.util.Date;

import static com.diffplug.common.base.Errors.rethrow;

/**
 * @author nsheremet
 */
public class ParallelDownload2 {
  public static int THREADCOUNT = 20;
  private static final String URL = "https://server.com/myfile.zip";
  public static String OUTPUT = "C:\\!deleteme\\myfile.zip";
  public static void main(String[] args) throws Exception {
    System.setProperty("https.protocols", "TLSv1,TLSv1.1,TLSv1.2");
    System.out.println(new Date());

    CloseableHttpClient httpClient = HttpClients.createDefault();

    HttpGet request = new HttpGet(URL);
    request.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36");
    CloseableHttpResponse response = rethrow().wrap(() -> httpClient.execute(request)).get();
    Long contentLength = Long.parseLong(response.getFirstHeader("Content-Length").getValue());
    long blocksize = contentLength / THREADCOUNT;

    RandomAccessFile randomAccessFile = new RandomAccessFile(new File(OUTPUT), "rwd");
    randomAccessFile.setLength(contentLength);
    randomAccessFile.close();
    response.close();

    for (long i = 0; i <THREADCOUNT; i++) {
      long startpos = i * blocksize;
      long endpos = (i + 1) * blocksize - 1;
      if (i == THREADCOUNT - 1) {
        endpos = contentLength;
      }
      new Thread(new DownloadTask(i, startpos, endpos)).start();
    }
    System.out.println(new Date());
  }

  public static class DownloadTask implements Runnable {

    public DownloadTask(
        long id,
        long startpos,
        long endpos
    ) {
      this.id = id;
      this.startpos = startpos;
      this.endpos = endpos;
    }

    long id;
    long startpos;
    long endpos;

    @Override
    public void run() {
      try {
        CloseableHttpClient httpClient = HttpClients.createDefault();

        HttpGet request = new HttpGet(URL);
        request.addHeader("Range", "bytes=" + startpos + "-" + endpos + "");
        request.addHeader("Connection", "keep-alive");
        request.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36");
        CloseableHttpResponse response = rethrow().wrap(() -> httpClient.execute(request)).get();

        if (response.getStatusLine().getStatusCode() == 206) {

          InputStream is = response.getEntity().getContent();
          RandomAccessFile randomAccessFile = new RandomAccessFile(new File(OUTPUT), "rwd");
          randomAccessFile.seek(startpos);
          int len = 0;
          byte[] buffer = new byte[1024*10];
          while ((len = is.read(buffer)) != -1) {
            randomAccessFile.write(buffer, 0, len);
          }
          is.close();
          randomAccessFile.close();
          System.out.println("Thread "+ Thread.currentThread().getId() +": Download");
        }
      } catch (IOException e) {
        e.printStackTrace();
      }
      System.out.println(new Date());
    }

  }

}

This is a modified copy from this one which is written via simple URL.openConnection. Why URL.openConnection with multithreading donwload the file with 10 Mb/sec while apach http client version speed mostly between 1-5 Mb/sec? Am I missed something in http apache client settings?

UPDATED

  1. I use multiple HttpClients because single object leads to same performance as 1 connection via URL
  2. Http apache client is used in mane high performance servers ,so I believe there is definitelly configuration issue. But what exactly?

About code

This of course not production ready code and should be considered as protote where I want to make multithreading download work fast.

About nultithreading

I can not explain why because I am not owner of downloading resource but multi threading downloading speed much faster (10 times) the single thread.

Cherry
  • 31,309
  • 66
  • 224
  • 364
  • I have no answer, but do realise that Apache HttpClient is basically to http clients what Oracle is to DBMSes. You pick it for the incredible configurability, not for how easy it is to use. Just sticking to the default implementation of HttpClient likely is at the root of your problems, that sets it up to be really basic. I wouldn't be surprised if chunked downloading is built-in and you want to let HttpClient do the work for you rather than rolling your own. The code you have is aimed at URLConnection which does nothing for you. – Gimby Feb 11 '20 at 12:19
  • Apache HttpClient of all versions has always been comfortably faster than HUC in JRE 1.8 and earlier. Some while ago I stopped comparing HttpClient performance compared to JRE HUC because it had become pointless. – ok2c Feb 11 '20 at 15:53
  • Why are you using multiple HttpClient instances instead of one? – ok2c Feb 11 '20 at 15:53
  • 1) Any `non default` configurations for apache client are welcome – Cherry Feb 12 '20 at 06:38
  • 2) About multiple HttpClient - single http client (with multiple requests) have same speed as single connection. – Cherry Feb 12 '20 at 06:39
  • I guess yo should have a single client with multiple connections (saves overhead). You are also downloading everything then writing to file, why not download streaming , which reduces memory pressure and thus easier on GC. Your code/test is weird, you get the file, then download things partially in multiple threads. So you already have the file, discard it and download again? ! – M. Deinum Feb 12 '20 at 06:48
  • Why are you creating the output file twice? And what makes you think multiple parallel downloads will be faster than a single one? The network isn't multithreaded. – user207421 Feb 12 '20 at 06:57
  • There are way too many moving parts in your code. For instance HttpClient may negotiate different TLS compared to JRE HUC (HttpClient ignores all system properties by default, so `https.protocols` setting in your code has no effect). Simplify your code: exclude TLS, remove multi-threading, do a single file download first, measure performance, gradually add more complexity. – ok2c Feb 12 '20 at 15:16
  • @user207421 Question updated. Please read attentively, multithreading is fater the parallel I do not know why but that is. And I "am not thinking so" - I have measure this and mention this in the question. I am not create file twice - it created one time and then different parts are written to it. – Cherry Feb 19 '20 at 05:59
  • @M.Deinum `You are also downloading everything then writing to file, why not download streaming , which reduces memory pressure and thus easier on GC` - the link points to single file, look at the code it does first request take content size fisrt. There is no streaming - when each part can be separate processed - the result is single file. – Cherry Feb 19 '20 at 06:02
  • @ok2c I have measured the performance with songle downloading and it slower then multithreading. – Cherry Feb 19 '20 at 06:03

1 Answers1

0

Few possibilities.

(1) The difference might not be transfer speed but initial latency when connecting. I had a similar problem in the past and the culprit turned out to be IPv6. Initial request was done on IPv6 and it was falling back to IPv4 silently, but only after a timeout.

Try running with -Djava.net.preferIPv4Stack=true, or specify host as numerical IPv4 quad and see if it makes a difference.

(2) The difference might be due to https implementation possibly cheking certificate paths, online revocation lists etc. Check with http URL if it makes a difference. If it does, look at Apache's documentation how to configure https behaviour to your liking.

(3) In any case, runing tcpdump or wireshark will likely give you more useful information.

jurez
  • 4,436
  • 2
  • 12
  • 20