Which is better: sending many small messages or fewer large ones?

Question

I have an app whose messaging granularity could be written two ways - sending many small messages vs. (possibly far) fewer larger ones. Conceptually what moves around is a set of 'alive' vertex IDs that might get filtered at each superstep based on a processed list (vertex value) that vertexes manage. The ones that survive to the end are the lucky winners. compute() calculates a set of 'new-to-me' incoming IDs that are perfect for the outgoing message, but I could easily send each ID one at a time. My guess is that sending fewer messages is more important, but then each set might contain thousands of IDs. Thank you.

P.S. A side question: The few custom message type examples I've found are relatively simple objects with a few primitive instance variables, rather than collections. Is it nutty to send around a collection of IDs as a message?

score 1 · Answer 1 · edited Sep 15 '14 at 08:10

I have used lists and even maps to be sent or just stored as vertex data, so that isn’t a problem. I think it shouldn’t matter for giraph which you want to choose, and I’d rather go with many simple small messages, as you will use Giraph appropriately. Instead you will need to go in the compute function through the list of messages and for each message through the list of IDs.

Performance-wise it shouldn’t make any difference. What I’ve rather found to make a big difference is, try to compute as much as possible in on cycle, as the switching between cycles and synchronising the messages, ... takes a lot of time. As long as that doesn’t change it should be more or less the same and probably much easier to read and maintain when you keep the size of messages small.

score 0 · Answer 2 · answered Oct 29 '16 at 16:08

In order to answer your question, you need understand the MessageStore interface and its implementations.

In a nutshell, under the hood, it took the following steps:

The worker receive the byte raw input of the messages and the destination IDs
The worker sort the messages and put them into A Map of A Map. The first map's key is the partition ID, the section map's key is the vertex ID. (It is kind of like the post office. The work is like the center hub, and it sort the letters into different zip code first, then in each zip code sorted by address)
When it is the vertex's turn of compute, a Iterable of that vertex's messages are passed to the vertex's compute method, and that's where you get the messages and use it.

So less and bigger messages are better because of less sorting if the total amount of bytes is the same for both cases.

score 0 · Answer 3 · answered Nov 01 '16 at 17:27

Also, you could send many small messages, but let Giraph convert this into a long one (almost) automatically. You can use Combiners.

The documentation on this subject is terrible on Giraph site, but you maybe could extract an example from the book Practical Graph Analytics with Apache Giraph.

This depends on the type of messages that you are sending, mainly.

Which is better: sending many small messages or fewer large ones?

3 Answers3