Thrift is too slow compared to direct calling function

Question

I try example from "http://thrift-tutorial.readthedocs.org/en/latest/usage-example.html". This example just calculate the product of two numbers. Server: Java, Client: Python.

If I try to get product via thrift in 3000 times, elapsed time is ~4.8s. If I create a simple function (multiply) in python and call it directly in 3000 times, elapsed time is ~0.007s (686x times faster).

So how can I improve the performance? I want to build an application and separate it into some sub-applications. They can be implemented in multiple languages and they will communicate to each other via thrift, but with this poor performance like that should I consider to combine them to sole application?

App-A (Java)                   App-B (Python)
     |                                 |
     |------------ App-C (C++) --------|

or

App-A+C (Java)                   App-B+C (Python)
(implement C in Java)            (implement C in Python)

Ben Voigt · Accepted Answer · 2014-04-03T19:23:54.827

3

Two key optimizations you can set as goals:

Send all the data you already have before waiting.
Don't send a computed result across the channel if the only thing done with it is to send it straight back.

What you have described in your question is an extreme case of a "chatty protocol". The network has latency (delay). If you wait for each result before starting the next computation, most of the time is spent waiting for the network transfer, not for the actual computation. By sending another computation before receiving the first result, you can improve throughput dramatically.

So the simplest thing is to allow overlapping requests. The product of the second pair of values doesn't depend on the first result, so don't wait for the first result to arrive.

When you are dealing with local IPC, that doesn't help so much. The cost of communication isn't delay, it's message processing and thread synchronization, depending on number of requests but not so much the order.

A bigger change with larger payoff is to make each request represents a complex algorithm. For example, instead of a remote call for a multiply on two numbers, try a remote call for an entire filtering operation, where the arguments are an entire data vector or matrix, and the server will perform FFTs, multiple, inverse FFT, scale, and then pass the result back. This satisfies both the original goals: all available data is sent together, instead of singly, reducing time spend waiting. And total network traffic is reduced because intermediate results don't have to be exchanged.

A final alternative is to link code from all three languages into a single process, so that data access and function calls are direct. Many languages allow building objects that export plain "C" functions and data.

Also, virtual machines such as .NET run intermediate languages that can be generated from compilation of different source languages. With .NET you have C# (Java-like), C++/CLI (supports full C++, plus extensions for working on .NET data), and IronPython, which cover your question diagram. Plus F#, JavaScript, a Ruby variant, and on and on. The Java virtual machine is supposed to be language-specific, but people have written Clojure and other languages that compile to bytecode.

The advantage of the virtual machine technique is that it enables some cross-language optimization (.NET JIT does cross-module inlining). The disadvantage is that your performance is dictated by JIT optimizations, which generally are the lowest common denominator. C++/CLI actually is really good for bridging this gap, because it supports fully-optimized native code (including SIMD), .NET intermediate language (MSIL), and the lowest overhead layer for communicating between them (C++ "It Just Works" interop).

But you could accomplish about the same thing on the Java VM, by using JNI to interface fully-optimized C++ code for intense number crunching using SIMD.

edited Apr 03 '14 at 19:23

answered Apr 03 '14 at 18:39

Ben Voigt

277,958
43
419
720

My example is just dummy. Ok, suppose I have a server and thousands of clients send requests at same time, my server serves each request of each client by communicating to "something" which do "actual work" via Thrift. The problem is my server call function directly faster than Thrift ~700x times, no matter how complex "actual work" is. I had thought Thrift help one app calls a function from other directly (not via network) as Python calls C++ extension. – William Apr 03 '14 at 19:10
@William: You can't make a direct call across process boundaries. What you have can be compared more to a local network inside the computer. It's still important to minimize the number of messages and amount of copying. Local IPC can use tricks like shared memory to avoid copying, but there's still going to be some overhead to telling another thread to check the shared data structure. – Ben Voigt Apr 03 '14 at 19:12
@William: And I disagree with your "no matter how complex actual work is". If you put all the work into only a handful of messages, then even though the function call is 700x slower, it's going from .001% of your program runtime to .701% of your program runtime. In other words, too small to worry about. – Ben Voigt Apr 03 '14 at 19:14
@William: If calls across process boundaries really are too expensive for you, then look for a way to make a single program from multiple languages. For example, .NET can load C# (Java-like), C++, and IronPython (python) into a single process and communicate with each other at low cost. – Ben Voigt Apr 03 '14 at 19:16
Ok, I get the point. I have an extra question about my system. I have 2 sub-apps (A (Java), B (Python)), both want to connect to database (MySQL). Should I separate database access to an other sub-app (C) or define it in each sub-apps (A + B)? If I separate, C can be used by many sub-apps but slower. If I combine, my system will have better performance but I have to define database access in each sub-apps. – William Apr 03 '14 at 19:22
@William: An in-memory database might actually be one of the fastest ways to communicate between cooperating processes. Ultimately you have to decide whether the database access support in each language is good enough or not. – Ben Voigt Apr 03 '14 at 19:25
Actually I don't know how real-world applications deal with this problem: many sub-apps and each of them need to use database. – William Apr 03 '14 at 19:29
1

"*I had thought Thrift help one app calls a function from other directly (not via network)*" - Did I mention that transport cost is not zero? And that this one applies also to IPC (without network)? Indeed, I think I said this. – JensG Apr 03 '14 at 19:42
@JensG: Yes, you said. It's true, transport cost is not zero. And now what I want to discuss is many apps need to use same database, how to design? Use Thrift, implement database access in each app or something else? – William Apr 03 '14 at 19:48
@Jens: Yes you clearly did. But I had used the word "network" in my explanation. – Ben Voigt Apr 03 '14 at 19:48
1

@William: Good question. It depends. It will not make much of a difference to the better if C acts only as a relay. However, if C does some kind of processing resulting in the effect, that the amount of data passed between A-C and B-C summed up will be significantly smaller than the traffic between C and your DB and if the data that are retrieved by C can satisfy multiple requests from A and/or B, this could indeed increase overall performance. – JensG Apr 03 '14 at 19:54

score 1 · Answer 2 · answered Apr 03 '14 at 18:34

1

Your comparison is based on incorrect assumptions. The assumption is, that a cross-process call (at least) is as fast as an in-process call, which is simply not true.

This is one of the famous 8 network fallacies originated by Peter Deutsch, later extended by others that does not only apply to networks, but also to IPC on a single machine: Contrary to what you think, transport cost is NOT zero.

From what I can tell based on your limited information, your 1.5 msec per IPC roundtrip sounds not so bad to me.

answered Apr 03 '14 at 18:34

JensG

13,148
4
45
55

I am suffering from this problem, for the past 2 days trying to figure out where I'm failing. At least now I know its a lost battle to try make it as fast in-process call. It takes 68 seconds! to finish a random sample... – Tony Tannous Feb 05 '17 at 17:19
1

@TonyTannous: If it is 68 seconds then it is very likely not the transport layer, unless you are trying to move insanely huge amounts of data. Without knowing more, I'd suggest to start checking the server end. – JensG Feb 06 '17 at 00:36
Actually I am sending huge amount of data. 10MB from client to server. Would moving the 10MB in smaller parts be better ? I divide the 1GB data into chunks of 10MB and send... – Tony Tannous Feb 06 '17 at 06:47
WOW! by slicing the `1GB` file to chunks of `110KB` instead of `10MB` it finished in 12 minutes instead of 22 minutes! amazing. Thanks :) already did the +1. – Tony Tannous Feb 06 '17 at 07:25
Great! Consider using either `TFramedTransport` or `TBufferedTransport` if you don't use it already. It may help to reduce the amount of memory allocations, hence improving performance. – JensG Feb 06 '17 at 14:18
I moved to a new pc 24gb ram with 8 processors instead of my machine 8gb ram dual core. After I understand how to install thrift on `ubuntu` I will try it! I will make threads as well. One question, in thrift inorder not to wait for results, Just to put `oneway` before generating the thrift file ? Thanks again :) – Tony Tannous Feb 06 '17 at 15:44
Yes, `oneway` does what it says. No results, and you don't even get exceptions back from the server. Once the request is sent, the client can move on w/o waiting any further. – JensG Feb 06 '17 at 17:13

Thrift is too slow compared to direct calling function

2 Answers2