Fastest way to store and retrieve a large stream of small unstructured messages

Question

I am developing an IOT application that requires me to handle many small unstructured messages (meaning that their fields can change over time - some can appear and others can disappear). These messages typically have between 2 and 15 fields, whose values belong to basic data types (ints/longs, strings, booleans). These messages fit very well within the JSON data format (or msgpack).

It is critical that the messages get processed in their order of arrival (understand: they need to be processed by a single thread - there is no way to parallelize this part). I have my own logic for handling these messages in realtime (the throughput is relatively small, a few hundred thousand messages per second at most), but there is an increasing need for the engine to be able to simulate/replay previous periods by replaying a history of messages. Though it wasn't initially written for that purpose, my event processing engine (written in Go) could very well handle dozens (maybe in the low hundreds) of millions of messages per second if I was able to feed it with historical data at a sufficient speed.

This is exactly the problem. I have been storing many (hundreds of billions) of these messages over a long period of time (several years), for now in delimited msgpack format (https://github.com/msgpack/msgpack-python#streaming-unpacking). In this setting and others (see below), I was able to benchmark peak parsing speeds of ~2M messages/second (on a 2019 Macbook Pro, parsing only), which is far from saturating disk IO.

Even without talking about IO, doing the following:

import json
message = {
    'meta1': "measurement",
    'location': "NYC",
    'time': "20200101",
    'value1': 1.0,
    'value2': 2.0,
    'value3': 3.0,
    'value4': 4.0
}
json_message = json.dumps(message)

%%timeit
json.loads(json_message)

gives me a parsing time of 3 microseconds/message, that is slightly above 300k messages/second. Comparing with ujson, rapidjson and orjson instead of the standard library's json module, I was able to get peak speeds of 1 microsecond/message (with ujson), that is about 1M messages/second.

Msgpack is slightly better:

import msgpack
message = {
    'meta1': "measurement",
    'location': "NYC",
    'time': "20200101",
    'value1': 1.0,
    'value2': 2.0,
    'value3': 3.0,
    'value4': 4.0
}
msgpack_message = msgpack.packb(message)

%%timeit
msgpack.unpackb(msgpack_message)

Gives me a processing time of ~750ns/message (about 100ns/field), that is about 1.3M messages/second. I initially thought that C++ could be much faster. Here's an example using nlohmann/json, though this is not directly comparable with msgpack:

#include <iostream>
#include "json.hpp"

using json = nlohmann::json;

const std::string message = "{\"value\": \"hello\"}";

int main() {
  auto jsonMessage = json::parse(message);
  for(size_t i=0; i<1000000; ++i) {
    jsonMessage = json::parse(message);
  }
  std::cout << jsonMessage["value"] << std::endl; // To avoid having the compiler optimize the loop away. 
};

Compiling with clang 11.0.3 (std=c++17, -O3), this runs in ~1.4s on the same Macbook, that is to say a parsing speed of ~700k messages/second with even smaller messages than the Python example. I know that nlohmann/json can be quite slow, and was able to get parsing speeds of about 2M messages/second using simdjson's DOM API.

This is still far too slow for my use case. I am open to all suggestions to improve message parsing speed with potential applications in Python, C++, Java (or whatever JVM language) or Go.

Notes:

I do not necessarily care about the size of the messages on disk (consider it a plus if the storage method you suggest is memory-efficient).
All I need is a key-value model for basic data types - I do not need nested dictionaries or lists.
Converting the existing data is not an issue at all. I am simply looking for something read-optimized.
I do not necessarily need to parse the entire thing into a struct or a custom object, only to access some of the fields when I need it (I typically need a small fraction of the fields of each message) - it is fine if this comes with a penalty, as long as the penalty does not destroy the whole application's throughput.
I am open to custom/slightly unsafe solutions.
Any format I choose to use needs to be naturally delimited, in the sense that the messages will be written serially to a file (I am currently using one file per day, which is sufficient for my use case). I've had issues in the past with unproperly delimited messages (see writeDelimitedTo in the Java Protobuf API - lose a single byte and the entire file is ruined).

Things I have already explored:

JSON: experimented with rapidjson, simdjson, nlohmann/json, etc...)
Flat files with delimited msgpack (see this API: https://github.com/msgpack/msgpack-python#streaming-unpacking): what I am currently using to store the messages.
Protocol Buffers: slightly faster, but does not really fit with the unstructured nature of the data.

Thanks!!

Have you benchmarked doing your 2nd option (flat files with delimited mspack) in C++ with memory mapped files (see [`boost.interprocess`](https://www.boost.org/doc/libs/1_67_0/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file))? — Steve Lorimer, May 05 '20 at 14:18
@SteveLorimer Thank you for your answer! As a first step with msgpack in C++, I ran the following benchmark: ```c++ #include #include const std::string message = "\x81\xa5hello\xa5world"; int main() { for(size_t i=0; i<10000000; ++i){ msgpack::object_handle oh = msgpack::unpack(message.data(), message.size()); msgpack::object deserialized = oh.get(); } }; ``` Compiling with -O3, this runs in 4.18 seconds, that is about 2.4M messages/second. Am I missing something? — Ben, May 05 '20 at 14:49
Interestingly, a Python version of this benchmark runs faster: ```python msgpack_mess = b'\x81\xa5hello\xa5world' %%timeit m = msgpack.unpackb(msgpack_mess) ``` runs at about 4M messages/second. I am very likely missing something. — Ben, May 05 '20 at 14:54
If you want a very fast message management, you likely need 1. to avoid allocations like the ones made by most json libraries, 2. avoid variant/dynamic types, 3. work on message chunks/buffers, 4. enable parallelism. You state that `there is no way to parallelize this part` but I do not think this is true: messages could be decoded/encoded in parallel although the computation and the transfer are sequential. Note that if you do not care about the portability of the stored/loaded files, you can do a *much faster* message encoding/decoding (using plain struct mapping). — Jérôme Richard, May 05 '20 at 18:13
Good point. Raw structs give the highest level of performance (100+M messages/second, saturating disk IO on my SSD). This is more than what I need (and ideally what I'd like to have). I don't really care about the portability so that's not a problem in itself, however structs are not ideal since they use a fixed schema, which is not really suitable for my use case. — Ben, May 05 '20 at 20:46

score 2 · Accepted Answer · answered May 09 '20 at 01:30

I assume that messages only contain few named attributes of basic types (defined at runtime) and that these basic types are for example strings, integers and floating-point numbers.

For the implementation to be fast, it is better to:

avoid text parsing (slow because sequential and full of conditionals);
avoid checking if messages are ill-formed (not needed here as they should all be well-formed);
avoid allocations as much as possible;
work on message chunks.

Thus, we first need to design a simple and fast binary message protocol:

A binary message contains the number of its attributes (encoded on 1 byte) followed by the list of attributes. Each attribute contains a string prefixed by its size (encoded on 1 byte) followed by the type of the attribute (the index of the type in the std::variant, encoded on 1 byte) as well as the attribute value (a size-prefixed string, a 64-bit integer or a 64-bit floating-point number).

Each encoded message is a stream of bytes that can fit in a large buffer (allocated once and reused for multiple incoming messages).

Here is a code to decode a message from a raw binary buffer:

#include <unordered_map>
#include <variant>
#include <climits>

// Define the possible types here
using AttrType = std::variant<std::string_view, int64_t, double>;

// Decode the `msgData` buffer and write the decoded message into `result`.
// Assume the message is not ill-formed!
// msgData must not be freed or modified while the resulting map is being used.
void decode(const char* msgData, std::unordered_map<std::string_view, AttrType>& result)
{
    static_assert(CHAR_BIT == 8);

    const size_t attrCount = msgData[0];
    size_t cur = 1;

    result.clear();

    for(size_t i=0 ; i<attrCount ; ++i)
    {
        const size_t keyLen = msgData[cur];
        std::string_view key(msgData+cur+1, keyLen);
        cur += 1 + keyLen;
        const size_t attrType = msgData[cur];
        cur++;

        // A switch could be better if there is more types
        if(attrType == 0) // std::string_view
        {
            const size_t valueLen = msgData[cur];
            std::string_view value(msgData+cur+1, valueLen);
            cur += 1 + valueLen;

            result[key] = std::move(AttrType(value));
        }
        else if(attrType == 1) // Native-endian 64-bit integer
        {
            int64_t value;

            // Required to not break the strict aliasing rule
            std::memcpy(&value, msgData+cur, sizeof(int64_t));
            cur += sizeof(int64_t);

            result[key] = std::move(AttrType(value));
        }
        else // IEEE-754 double
        {
            double value;

            // Required to not break the strict aliasing rule
            std::memcpy(&value, msgData+cur, sizeof(double));
            cur += sizeof(double);

            result[key] = std::move(AttrType(value));
        }
    }
}

You probably need to write the encoding function too (based on the same idea).

Here is an example of usage (based on your json-related code):

const char* message = "\x01\x05value\x00\x05hello";

void bench()
{
    std::unordered_map<std::string_view, AttrType> decodedMsg;
    decodedMsg.reserve(16);

    decode(message, decodedMsg);

    for(size_t i=0; i<1000*1000; ++i)
    {
        decode(message, decodedMsg);
    }

    visit([](const auto& v) { cout << "Result: " << v << endl; }, decodedMsg["value"]);
}

On my machine (with an Intel i7-9700KF processor) and based on your benchmark, I get 2.7M message/s with the code using the nlohmann json library and 35.4M message/s with the new code.

Note that this code can be much faster. Indeed, most of the time is spent in efficient hashing and allocations. You can mitigate the problem by using a faster hash-map implementation (eg. boost::container::flat_map or ska::bytell_hash_map) and/or by using a custom allocator. An alternative is to build your own carefully tuned hash-map implementation. Another alternative is to use a vector of key-value pairs and use a linear search to perform lookups (this should be fast because your messages should not have a lot of attributes and because you said that you need a small fraction of the attributes per message). However, the larger the messages, the slower the decoding. Thus, you may need to leverage parallelism to decode message chunks faster. With all of that, this is possible to reach more than 100 M message/s.

Fastest way to store and retrieve a large stream of small unstructured messages

1 Answers1