6

The goal of this application is to handle 800 simultaneous clients over TCP, each of them sends a 3.5kb xml every second. Each of these requests needs to be parsed (see code snipped). This happens on different threads.

The limitation of this project is that it has to run on a small Raspberry Pi3 (1.2 ghz quad core, 1gb ram). And i run into utilization issues when i increase the load above 150 simultaneous clients (80% cpu usage).

When i run this program my development machine it seems to run very well. (0-1% usage, under 150). I understand that my development machine is more powerful than the RPI, and therefore runs better. But the difference seems to be too big.

In my current setup i use Java nio to handle/read all the incoming connections. Then i use multiple threads to read the data.

Current Setup

This is the simple code that currently runs on the processing thread. Also tried reading a simple byte[] array 1 byte at a time. And even reading a StaX stream. Every variation of reading that i tried, the 'read type' of operation gives gives the worst performance.

BufferedInputStream input = new BufferedInputStream(new ByteArrayInputStream(buffer.array(), 0, bytecount));
int current;
/* In this snippet input.read() is the cause of performance issues 
Reading directly from byte[] gives similar poor performance.
*/
while ((current = input.read()) != -1) {
    continue;
}

According to my profiler the Input.read() call uses a huge amount of processing power on the Pi and accounts for 97% of the total cpu time. the other 3% is the Main thread that handles the connections.

On my development machine this is almost flipped and the Main thread accounts for most of the cpu usage, 93%. And 7% goes to the processing threads.

What could cause this big difference? why is this read() call on the pi so expensive compared to my other machine, could it have something to do with memory?

Notes:

  • Pi runs raspbian linux - openjdk 1.8.0_40-internal
  • Dev machine runs win 10 - Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
  • Tried running with -Xms -Xmx flag on both machines, same result.
Georggroenendaal
  • 314
  • 1
  • 14
  • 3
    The Pi has **absolutely the worst IO**. Your PC has a massive spinning disc, or possibly even an SSD - this is **orders of magnitude** faster than a crappy SD card. And that's not even starting on CPU architecture, number of cores, hyperthreading etc. – Boris the Spider Jan 31 '17 at 11:31
  • I assume that you are running this over Ethernet? This article gives some interesting information https://www.jeffgeerling.com/blogs/jeff-geerling/getting-gigabit-networking note that it says "However, for many real-world use cases, the Pi's other subsystems (CPU and disk I/O especially, since I/O is on a single, shared USB 2.0 bus) will limit the available bandwidth." I have also read that you need to make sure that you get enough power, as too little reduces performance (max draw should be around 1,100mAh, but this might vary by model) – jason.kaisersmith Jan 31 '17 at 11:43
  • 1
    The application doesnt do any (explicit) IO to disk, The pi can handle the sockets IO just fine (without processing), in the main thread the sockets get read and stored in bytebuffers. the processing thread should then read from RAM if i am correctly. – Georggroenendaal Jan 31 '17 at 11:53
  • 1
    Also, the RPI3's hardware is ARMv8 (with a Cortext-A53), but the raspian image is legacy ARM-32 armhf (its not even Aarch32). Raspian is not using the device to its potential. You might consider using an Aarch64 image provided by openSUSE, CentOS or one of the other distros. – jww Jan 31 '17 at 11:56
  • "_And eventually stored in a database._" - so this database is not on the Pi I take it? Please make that clear. – Boris the Spider Jan 31 '17 at 12:07
  • @BoristheSpider no this database runs on a different machine but this is not yet implemented. currently only runs the simple code snipped, where i run into issues. – Georggroenendaal Jan 31 '17 at 12:29
  • 1
    It is a bit of an odd construction with lots of overhead, if I read this correctly. I wonder what the purpose is of a BufferedInputStream when you're already tanking everything into memory inside a ByteArrayInputStream... only to then take it out 1 byte at a time again. – Gimby Jan 31 '17 at 13:29
  • @Gimby this snipped was last snipped i tried hoping the buffer would increase performance. At first i tried just using a simple byte[] array, and reading that 1 byte at the time. Very similar result, basically every kind of read operation i tried takes lots of cpu (Also tried bytearrayinputstream and even Java stax, everytime the 'read' call performs bad) – Georggroenendaal Jan 31 '17 at 13:37
  • 1
    Consider adding that information to the question, it basically rules out that it is related to the specific snippet of code. – Gimby Jan 31 '17 at 13:51
  • Does it make much difference to the timings if you change the number of processing threads (to, say, 3)? – Klitos Kyriacou Jan 31 '17 at 14:29
  • 1
    @jww Just installed OpenSuse for ARMv8, And this resolved my issue. Pi can now handle the same load easily, It can even handle the full 800 requests/s (15% usage). Note that i now use Java(TM) SE Runtime build 1.8.0_121-b13. Still a mistery what caused the poor performance, jvm or os or both. But the issue seems resolved, so thanks alot!. – Georggroenendaal Jan 31 '17 at 15:39
  • If you have only a single client with a single connection, sending 200 request/second, can your hardware handle it? If not, you hit the limit. – ZhongYu Jan 31 '17 at 16:06
  • 1
    @Georggroenendaal - *"Still a mystery what caused the poor performance...."* - The performance characteristics of the Cortex-A53 and Cortex-A{7|8|9} are different. I believe the answer lies in the various ARM Optimization Guides. – jww Jan 31 '17 at 16:45
  • 2
    @Georggroenendaal - I recently got bit by differences between the Corext-A53 and Cortex-A57. Both are ARMv8, and the A57 is the higher-end model. But my [BLAKE2b implementation for Crypto++ ran slower on the A57](https://github.com/weidai11/cryptopp/issues/367). It was even slower than a C++ implementation. It turned out NEON/ASIMD shifts were more expensive: they had a 7 cycle latency and could only be run on one of the pipes. A non-NEON implementation using integer ops was faster. The Optimization Guide gave us the details and explained it. – jww Jan 31 '17 at 16:47

1 Answers1

0

Turns out that the problem was a combination of both JVM and 32bits OS on the Raspberry Pi 3. When running 32bit raspbian with OpenJDK my application had very poor performance (Especially on the 'read' calls). Switching to the Oracle JVM gave me the 'better' expected performance.

However when switching to a 64bit OS (OpensSuse in my case) the performance was good with both the OpenJDK and Oracle JVM.

(Credits to @jww in the comments, for suggesting to switch to a 64 bit OS)

Georggroenendaal
  • 314
  • 1
  • 14