Occasional excruciatingly slow resources transfer with Tomcat on AWS EC2

Question

I have a running Tomcat v9 instance in a Docker container on an AWS EC2 host.
It works perfectly, most of the time, and will once in a while deliver resources very slowly.

What exactly is served slowly and what is "slow"?

I'm talking about both plain static files outside any WAR but also -- and more annoyingly, servlet responses.

I witness some 300 to 400KB resources served in 5 to 12s, where they usually (95%+ of the time) arrive in ~300ms.

Here's an example of what Chrome's Network tab tells me about these dreaful resource transfers:

319KB in 12.02s

278KB in 5.37s

I have no idea what causes this. I have read many threads and tried many configurations but still can't understand what is going on.

From within my VPC

As @Tim suggested, I tried putting the client inside my AWS VPC to rule out network latency and bandwidth as the cause of the issue.

In this setup, I get answers "much slower", with a minimum Content Download time of ~600ms where it could sometimes be only 200ms in the "outside world".
I still notice the slow peaks that are an issue to me, but instead of going for tens of times the minimum, "normal" download duration, it goes up only to about 2.5s at most.

Response size       | 319 KB
------------------------------
Waiting (TTFB)      | 99.00 ms
Content Download    | 2.81 s

The "waiting" time is the same, as expected since it represents the time spent by my servlet to treat the request before starting to respond.

My environment(s)

I have replicated that phenomenon with the following environment configurations:

On AWS linux AMI t2.medium

Tomcat v9 WITHOUT Docker
Tomcat v9 w/ Docker
Tomcat v8.5 w/ Docker
Tomcat v8.0 w/ Docker

Tomcat v9 w/ Docker on:

AWS linux AMI t2.micro
AWS linux AMI t2.medium
AWS linux AMI m4.xlarge

On AWS linux AMI t2.medium via ECS

Tomcat v9 w/ Docker

At least half of these tests are probably stupid, but well... better too much information than too little.

What I think I can rule out after these is:

my instance is too small (fails with m4.xlarge)
Tomcat's newer versions somehow handle things differently
Docker container's overhead messing things up

My Tomcat configuration

Alright, it must come from there, right? So, here it is:

<?xml version="1.0" encoding="UTF-8"?>
<Server port="8005" shutdown="SHUTDOWN">
    <Listener className="org.apache.catalina.startup.VersionLoggerListener" />
    <Listener className="org.apache.catalina.core.JreMemoryLeakPreventionListener" />
    <Listener className="org.apache.catalina.mbeans.GlobalResourcesLifecycleListener" />
    <Listener className="org.apache.catalina.core.ThreadLocalLeakPreventionListener" />

    <Service name="Catalina">
        <Connector port="8009" protocol="AJP/1.3" redirectPort="8443" maxThreads="1500" />
        <Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="443" maxThreads="1500" />
        <Connector port="8443" protocol="org.apache.coyote.http11.Http11Nio2Protocol" sslImplementationName="org.apache.tomcat.util.net.jsse.JSSEImplementation" maxThreads="1500" SSLEnabled="true"
            scheme="https" secure="true" keystoreFile="/root/ssl/XXXXXXXX.jks" keystorePass="XXXXXXXX" clientAuth="false" sslProtocol="TLS" compression="on" compressionMinSize="1024"
            compressableMimeType="application/json" />

        <Engine name="Catalina" defaultHost="localhost">
            <Realm className="org.apache.catalina.realm.LockOutRealm">
                <Realm />
            </Realm>
            <Host name="localhost" appBase="webapps" unpackWARs="true" autoDeploy="true">
                <Valve className="org.apache.catalina.valves.AccessLogValve" directory="logs" prefix="localhost_access_log." suffix=".txt" pattern="%h %l %u %t &quot;%r&quot; %s %b" />
            </Host>
        </Engine>
    </Service>
</Server>

The <Realm /> bit is of course to be replaced by an actual JDBC Realm configuration. Just pointing that out; it's not the issue anyways :).

As you can see, I have tried increasing the maxThreads attribute in all my connectors, as featured in that answer. No changes.

More information

I have a JMX thingy showing me what's going on on my JVM exactly, I use VisualVM to visualise it all but, as you can probably guess by how I talk about it, I have close to no idea what I'm looking at.

My https-jsse-nio2-8443-exec-X threads, which are the only ones that seem to be doing something when a request hit the server, are just indifferent to "slow" or "normal" requests. But then again, maybe I just don't see it.

Maybe YOU would see it though :), so here's a screenshot of VisualVM during a slow request:

It's just "parked" (orange color) and sometimes goes "running" (green), but just momentarily and it doesn't match "slow" requests or anything. Maybe there's actually nothing to see here.

I can provide you with thread dumps and everything you need!

My actual question

What can I change to have a consistent and reasonable transfer rate?

Have you checked your EC2 instance credit balance and EBS credit balance? You may need a larger instance. It's probably not that but you need to rule that out before you look at anything else. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html#t2-instances-monitoring-cpu-credits . EBS credits monitored similarly, that feature only recently released to CloudWatch. — Tim, Nov 21 '16 at 18:25
@Tim I actually replicated the exact same environment under an `m4.xlarge` instance and would run into the same issue... It really seems totally occasional. Even when I'm encountering these slow requests, my instances sit at the maximum CPU Credit Balance (576 on `t2.micro`). I am not using Elastic Beanstalk but maybe I should give it a try? — ccjmne, Nov 21 '16 at 18:57
Interesting. Blue indicates slow content download, but that could be slow production of the web page as it streams across. Can you parse your access logs to produce a graph of response time distribution for the servlet and for a static resource? Can you match the slow download to an access log entry and look at time to generate the page? Can you try with the client in AWS to eliminate internet network bandwidth and latency as an issue? — Tim, Nov 21 '16 at 19:18
@Tim Thank you so much for your interest in my problem! Would you come in [this chat room I've created](http://chat.stackexchange.com/rooms/48898/occasional-excruciatingly-slow-resources-transfer-with-tomcat-on-aws-ec2) so we can discuss it in more detail there? — ccjmne, Nov 21 '16 at 19:27
Packets of size 1514 are normal.See http://www.speedguide.net/faq/is-an-ethernet-framepacket-1500-or-1514-bytes-450 — Jason Martin, Nov 22 '16 at 17:49
Try running an Iperf test described at https://aws.amazon.com/premiumsupport/knowledge-center/network-throughput-benchmark-linux-ec2/ , which will help determine if it is a throughput issue or application issue. Also, enable garbage collection logging to see if the application is spending all its time in GC. — Jason Martin, Nov 22 '16 at 17:52