5

I'm currently working on a latency critical application. After profiling my code, I concluded that the bottleneck might be related to my WebSocket client. I don't control the server, only the client.

Context

I'm running a c5d.xlarge Ubuntu instance with ENA (Elastic Network Adapter) in the same availability zone as the server. If I ping the server from my instance, the latency is below 1 ms (usually between 0.2 and 0.5 ms). My code is written in Haskell and this is the WebSocket library I use: http://hackage.haskell.org/package/websockets The difference between the message timestamp sent by the server and the timestamp I create upon receiving the message is usually below 10 ms which is good enough for me. However, sometimes this time difference increases to 4-500 ms or even worse (above 1-1.5s). I suspect this happens when the server has to deal with increased activity, thus when the server is under heavy load. Despite the fact that I didn't find any indication that it might be my code, my instance setup or the combination of both that causes this latency spike, I was still skeptical and so I headed to https://websocket.org/echo.html from my local machine (a Mac) in order to check the latencies from there (using Chrome as my browser). Since this is farther from the server than my instance, I get slightly higher latencies but usually below 50 ms. The 4-500 ms latency spikes are happening here as well.

Reproduction

In Chrome, head over to https://websocket.org/echo.html and connect to: wss://www.bitmex.com/realtime Finally, send the following message to the server and inspect the WebSocket frames under the Network tab in Chrome Developer Tools:

{"op": "subscribe", "args": ["orderBook10:XBTUSD"]}

Question

Is there any way to further minimise latency in such situation? Are there any rules of thumb when configuring my WebSocket client connection that I might be missing? I tried using the WebSocket compression extensions but that didn't seem to help. I feel like that there isn't much I can do about this, since the server is not under my control. Thanks!

Lucsanszky
  • 71
  • 1
  • 5
  • 1
    Could the delay be caused by garbage collection? – danidiaz Aug 13 '18 at 19:43
  • Thanks for the response @danidiaz. Unfortunately, I doubt that this is caused by GC (this was my initial guess as well). If I run the program without threading, I get: `%GC time 6.4% (0.1% elapsed)`. With threading enabled, GC is still less than 15% and less than 1% of elapsed time. Also, I observe the same behaviour in the Chrome browser when examining the WS traffic at websocket.org, or at bitmex.com. I get latency spikes when the server is under heavy load (presumably) and hence my question whether I can do anything about this when the server is not under my control. Thanks! – Lucsanszky Aug 14 '18 at 10:38
  • @KylorRAM Have you tried to test the servers with another websocket library, or perhaps a command-line tool? Just to discard non-Haskell causes. – danidiaz Aug 14 '18 at 18:59
  • @danidiaz, that's what I did, see the reproduction step in my question. I used the WS client provided by the https://websocket.org/echo.html site to check the connectivity from there. My reasoning was the same as yours: get rid of the Haskell specific things to see if it's a Haskell problem or something else. Sorry if that was not clear from my post. – Lucsanszky Aug 14 '18 at 20:26
  • 1
    Perhaps tcpdumping / profiling on the server side could narrow down the problem space - e.g. is it the arriving or outgoing message that is delayed. However it seems that c5 / ENA have|had some driver issues, there might be optimisations in Amazon Linux (see https://www.reddit.com/r/aws/comments/7whfhn/new_nitro_based_m5c5_instances_seem_unstable/). – Marcel Boldt Aug 14 '18 at 20:59
  • Great, thank you @MarcelBoldt! I'll look into these and post the results here. – Lucsanszky Aug 15 '18 at 13:17
  • @MarcelBoldt although the server is not under my control (it's under the exchange's control), I measured the RTT, avg throughput and segment length from both my local machine and my c5d.xlarge instance. Results: https://ibb.co/ifeTYK https://ibb.co/ghNRLz https://ibb.co/gf8Bne https://ibb.co/g9X47e Sorry for the different X-axis in one of the graphs! Let me know if you want to have a look at the .pcap file (I only have it for the c5d instance though). – Lucsanszky Aug 19 '18 at 14:40

0 Answers0