Strange bottleneck with apache HttpClients (sync and async)

Question

I'm trying to test the limits of the apache http client libraries and I'm running into a strange bottleneck. My test consists of the following:

A wiremock server that simulates a fixed 80ms delay
A unit test which uses an org.apache.http.impl.nio.client.CloseableHttpAsyncClient to make a configurable number of requests to the wiremock server as fast as possible, while gathering statistics.
A unit test which uses an org.apache.http.impl.client.CloseableHttpClient to make a configurable number of requests to the wiremock server as fast as possible, from a configurable number of threads, while gathering statistics.
A unit test which uses an org.springframework.web.reactive.function.client.WebClient to make a configurable number of requests to the wiremock server as fast as possible, while gathering statistics.

All of the the tests show the same performance numbers of 570 requests / second on my local machine. The CPU is very low when running these tests, about 5% utilization. So I can assume that the bottleneck is not in the CPU, but rather somewhere else...

My question is where is this bottleneck and how can we widen it?

My System Configuration:

Processor: 3.1 GHz Quad-Core Intel Core i7
Memory: 16 GB 2133 MHz LPDDR3
OS: OSX 10.15.5
Java version: 11.0.4

My Unit Test:

package com.blakeparmeter.bottleneck_mystery;

import org.apache.http.HttpResponse;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.concurrent.FutureCallback;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.nio.client.CloseableHttpAsyncClient;
import org.apache.http.impl.nio.client.HttpAsyncClients;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;

import java.io.IOException;
import java.net.URI;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ForkJoinPool;
import java.util.stream.IntStream;

/**
 * Used to illustrate a performance bottleneck with the apache HttpClients
 */
public class BottleneckTest {

    // Test variables
    final int totalTests = 5_000;
    final long messageInterval = 1000; //ms
    final URI testUri =  URI.create("http://localhost:5000/wait/fixed/empty");

    @Test
    public void testSync() throws InterruptedException, ExecutionException {

        final int numThreads = 100;

        // Creates the sync client (unit under test)
        final CloseableHttpClient unitUnderTest = HttpClientBuilder
                .create()
                .setMaxConnTotal(5000)
                .setMaxConnPerRoute(5000)
                .build();

        // Run the test on an executor, send results to a stats aggregator
        final ForkJoinPool executor = new ForkJoinPool(numThreads);
        final StatsAggregator statsAggregator = new StatsAggregator(totalTests, messageInterval);
        executor.submit(() -> IntStream.range(0, numThreads)
                .parallel()
                .forEach(threadNum -> IntStream.range(0, totalTests / numThreads).forEach(testNum -> {
                    final long runStart = System.currentTimeMillis();
                    try (final CloseableHttpResponse response = unitUnderTest.execute(new HttpGet(testUri))) {
                        // we don't need to do anything with the response, just make sure it's sent.
                    } catch (final IOException e) {
                        Assertions.fail(e);
                    }
                    statsAggregator.addTestDuration(System.currentTimeMillis() - runStart);
                })))
                .get();

        // print the stats one last time (await is not needed since we wait on the executor)
        statsAggregator.printStats();
    }

    @Test
    public void testAsync() throws InterruptedException {

        // Creates the async client (unit under test)
        final CloseableHttpAsyncClient unitUnderTest = HttpAsyncClients.custom()
                .setMaxConnTotal(5000)
                .setMaxConnPerRoute(5000)
                .build();
        unitUnderTest.start();

        // Runs all of the tests, sends results to a stats aggregator
        final CountDownLatch testCountdown = new CountDownLatch(totalTests);
        final StatsAggregator statsAggregator = new StatsAggregator(totalTests, messageInterval);
        IntStream.range(0, totalTests).forEach(testNum -> {

            final long runStart = System.currentTimeMillis();
            unitUnderTest.execute(new HttpGet(testUri), new FutureCallback<>() {

                @Override
                public void completed(final HttpResponse response) {
                    statsAggregator.addTestDuration(System.currentTimeMillis() - runStart);
                    testCountdown.countDown();
                }

                @Override
                public void failed(final Exception ex) {
                    Assertions.fail(ex.getMessage());
                }

                @Override
                public void cancelled() {
                    Assertions.fail("Http Request Cancelled");
                }
            });
        });

        // await execution then print the stats one last time
        testCountdown.await();
        statsAggregator.printStats();
    }

    @Test
    public void testReactive() {

        final WebClient unitUnderTest = WebClient.builder().build();

        // Runs all of the tests, sends results to a stats aggregator
        final StatsAggregator statsAggregator = new StatsAggregator(totalTests, messageInterval);
        Flux.range(0, totalTests)
                .flatMap(testNum -> {
                    final long runStart = System.currentTimeMillis();
                    return unitUnderTest.get()
                            .uri(testUri)
                            .retrieve()
                            .bodyToMono(Object.class)
                            .doOnSuccess(obj -> statsAggregator.addTestDuration(System.currentTimeMillis() - runStart));
                })
                .then()
                .block();

        // print the stats one last time
        statsAggregator.printStats();
    }
}

StatsAggregator.java:

package com.blakeparmeter.bottleneck_mystery;

import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Timer;
import java.util.TimerTask;

/**
 * @author Blake L. Parmeter
 */
public class StatsAggregator {

    private static final DecimalFormat testCompleteFormat = new DecimalFormat("###,###,###");
    private static final DecimalFormat avgRequestFormat = new DecimalFormat("###,###.##");

    private final long start = System.currentTimeMillis();
    private final List<Long> times;
    private final TimerTask renderStatisticsTask;

    // Creates a timer task to calculate and render runtime stats in realtime.
    public StatsAggregator(final int totalTests, final long messageIntervalMillis) {

        this.times = new ArrayList<>(totalTests);

        renderStatisticsTask = new TimerTask() {

            private Long lastLogTime = null;
            private Integer lastLogSize = null;

            @Override
            public void run() {

                // Init variables needed for calculations
                final long logTime = System.currentTimeMillis();
                final List<Long> statsCopy;
                synchronized (times) {
                    if (!times.isEmpty()) {
                        statsCopy = new ArrayList<>(times);
                    } else {
                        System.out.println("No statistics have been loaded. Statistics will not be calculated.");
                        return;
                    }
                }
                Collections.sort(statsCopy);

                // print execution completion status
                System.out.println();
                final double percentComplete = ((double) statsCopy.size() / (double) totalTests);
                final long runtime = logTime - start; //ms
                final double estimatedTimeRemaining = ((double) runtime / percentComplete) - (double) runtime; //ms
                System.out.println(testCompleteFormat.format(statsCopy.size())
                        + "\tTests completed of:"
                        + testCompleteFormat.format(totalTests)
                        + "\t"
                        + avgRequestFormat.format(percentComplete * 100)
                        + "% complete. "
                        + "Running for: "
                        + runtime / 1000d
                        + " seconds. "
                        + "Estimated Time remaining: "
                        + testCompleteFormat.format(estimatedTimeRemaining / 1000d)
                        + " seconds.");

                // print running average requests / second
                String sinceLastLogStats = "";
                if (lastLogSize != null && lastLogTime != null) {
                    double numSinceLastLog = (double) statsCopy.size() - lastLogSize;
                    double timeSinceLastLog = (double) logTime - lastLogTime;
                    double avgReqPerSecSinceLastLogSec = 1000 * (numSinceLastLog / timeSinceLastLog);
                    sinceLastLogStats = "\tavg req/sec:"
                            + avgRequestFormat.format(avgReqPerSecSinceLastLogSec)
                            + "(since last run)";
                }
                lastLogSize = statsCopy.size();
                lastLogTime = logTime;
                double avgReqPerSec = 1000 * ((double) statsCopy.size() / (double) (logTime - start));
                System.out.println("\tavg req/sec:"
                        + avgRequestFormat.format(avgReqPerSec)
                        + "(total)"
                        + sinceLastLogStats);

                // print average min and max
                double avg = (double) statsCopy.stream().reduce(Long::sum).orElseThrow() / (double) statsCopy.size();
                System.out.println("\tavg:" + avgRequestFormat.format(avg) +
                        "\tmin:" + statsCopy.get(0) +
                        "\tmax:" + statsCopy.get(statsCopy.size() - 1));

                // print percentiles
                System.out.println("\tRequest duration percentiles:\n" +
                        "\t\t1%:" + percentile(statsCopy, 1) +
                        "\t5%:" + percentile(statsCopy, 5) +
                        "\t10%:" + percentile(statsCopy, 10) +
                        "\t50%:" + percentile(statsCopy, 50) +
                        "\t90%:" + percentile(statsCopy, 90) +
                        "\t95%:" + percentile(statsCopy, 95) +
                        "\t99%:" + percentile(statsCopy, 99) +
                        "\t99.9%:" + percentile(statsCopy, 99.9) +
                        "\t99.99%:" + percentile(statsCopy, 99.99));

                System.out.println("\tCalculations took:" + (System.currentTimeMillis() - logTime) + "ms.");
            }
        };

        // Schedule printing of statistics on a timer
        final Timer timer = new Timer("test-output-timer", true);
            timer.schedule(renderStatisticsTask,messageIntervalMillis,messageIntervalMillis);
    }

    public void printStats() {
        renderStatisticsTask.run();
    }

    public void addTestDuration(final long time) {
        times.add(time);
    }

    private static long percentile(List<Long> times, double percentile) {
        int index = (int) Math.ceil(percentile / 100.0 * times.size());
        return times.get(index - 1);
    }
}

Running wiremock for this test:

Download the standalone jar from here: http://wiremock.org/docs/running-standalone/ into a directory which will be referred to as <wiremock_directory> in these instructions. The file downloaded will be referred as <wiremock.jar>
Create a directory <wiremock_directory>/mappings
Create a file named endpoint.json and put it in <wiremock_directory>/mappings

{
    "request": {
        "method": "GET",
        "urlPathPattern": "/wait/fixed/empty"
    },
    "response": {
        "headers": {
            "Content-Type": "application/json;charset=UTF-8"
        },
        "status": 200,
        "fixedDelayMilliseconds": 80
    }
}

Run wiremock in a mode that can handle high concurrency using the command: java -jar <wiremock.jar> --port 5000 --container-threads 250 --jetty-acceptor-threads 200 --no-request-journal NOTE: replace the file name in the command with the file downloaded.

score 3 · Answer 1 · answered Jan 20 '21 at 19:08

If you cannot max out your CPU's cores, that means your threads are waiting. Most likely for IO. As you are making a network call. In general, as soon as IO comes into play, you are unlikely to create a CPU bound load. To find out what your threads are doing create a thread dump while your test is running and have a look at their stacktraces.

A couple more things:

Better not use a ForkJoinPool if you do not create tasks that are geared towards work stealing. I.e. ForkJoinTasks. In your case a simple ThreadPoolExecutoris likely to perform better.
You might be shooting yourself in the knee with the parallel stream. You create an executor with lots of threads that each receive a Runnable to execute which uses the parallels engine which in turn uses a single shared ForkJoinPool with as many threads as you have logical CPU cores. If you have 8 logical cores that means you have 100 threads driving tasks that all share a single 8-thread-pool to do their work. (Though if that was the case you should only get about 100 requests per second, if I got my math right.)
Collecting the timing of each request can be costly, too. I'd rather not call System.currentTimeMillis() twice on each request. (Most likely the cost is negligible compared to the cost of the networked request the http client does, though.)

I don't think that there is a whole lot of room for performance gains using different classes from `java.concurrent`. The threads are all waiting on IO from the http callsI suspect. What is very strange is that at given the 80ms delay almost all of the http clients show the same throughput of 570 requests / second. When I remove the delay I get about 20k requests / second with the async clients. I think there is some limit on the OS level or similar that only allows 500-600 concurrent HTTP requests regardless of the http client library. I wanted to know if there is a way to raise this limit. — Doctor Parameter, Jan 20 '21 at 19:59
As I said, you are unlikely to be able to create CPU bound load with a test like this where the "work" is purely IO. And networked IO at that. There is no OS parameter that says "max number of concurrent http connections". — Ralf, Jan 21 '21 at 07:40

Strange bottleneck with apache HttpClients (sync and async)

1 Answers1