LiteDB Deadlocks - I may be using LiteDB wrong

Question

So I am writing a server in C#. It has a LiteDB database to store messages in. Every time the server receives a message, it stores the message in the DB, and sends a copy to the proper destination.

Since LiteDB is supposed to be thread-safe, I don't use locks to synchronize reads/writes to the DB.

I've made a test (on a WPF app) that connects 100 clients, each in a separate task. Each client sends a message to all other clients and expects to receive a message from all other clients. I then make sure that all messages were sent/received as expected. For each client and message, I add some random delay. (More on that in a bit.)

I observe some weird behavior in the test that I would love some help with. The first 400-2000 messages are properly received by the server, inserted to the DB, and forwarded to their destination. However, after a while, everything slows down. A message is processed once a second. That is, I see a text (outputted to console) saying that a message is received, then the message is inserted to the DB, but I don't see a text saying that it was successfully inserted. I assume that this message is awaiting some inner LiteDB lock to unlock or something.

Eventually, after 1 minute of this, I get the following exception:

Exception thrown: 'LiteDB.LiteException' in LiteDB.dll
Database lock timeout when entering in transaction mode after 00:01:00

I guess this exception is thrown because there IS some deadlock. I just can't figure out what is causing it. (So I don't really care about the exception per se, I just mention it since it may help determine the cause of the actual problem.)

When I have 0 delay between messages, this issue happens a lot sooner. When I set around ~50ms delay, I manage to pass the test. I assume that I either have some deadlock in LiteDB, or that (on the case of low delay) I have too many simultaneous tasks running? (Not sure if the latter is actually true.)

I've added a simplified version of my server's code.

public class Worker : BackgroundService
{
    private static readonly TimeSpan _collectionInterval = TimeSpan.FromSeconds(1);
    private static readonly TimeSpan _expirationThreshold = TimeSpan.FromMinutes(2);
    
    private readonly ConcurrentDictionary<string, Client> Clients = new();

    private readonly DBHandler _dbHandler = DBHandler.GetInstance(AppDomain.CurrentDomain.BaseDirectory);

    protected override async Task ExecuteAsync(CancellationToken cancellationToken)
    {
        CancellationTokenSource cancellationTokenSource = CancellationTokenSource
                .CreateLinkedTokenSource(cancellationToken);
        CancellationToken token = cancellationTokenSource.Token;
    
        List<Task> tasks = new();

        try
        {
            listener.Start();

            while (!token.IsCancellationRequested)
            {
                TcpClient tcpClient;
                try
                {
                    // Accept an incoming client connection
                    tcpClient = await listener.AcceptTcpClientAsync(token);
                }
                catch (Exception ex)
                {
                    cancellationTokenSource.Cancel();
                    break;
                }

                // Handle the client connection in a separate task
                tasks.Add(
                    HandleClientConnectionAsync(tcpClient, token)
                );
            }

            cancellationTokenSource.Cancel();
            await Task.WhenAll(tasks);
        }
        finally
        {
            _dbHandler.Dispose();
            listener.Stop();
            cancellationTokenSource.Dispose();
            token.ThrowIfCancellationRequested();
        }
    }

    private async Task HandleClientConnectionAsync(TcpClient tcpClient, CancellationToken token)
    {
        Client? client = null;

        try
        {
            client = new(tcpClient);
            client.EnableKeepAlive();

            await SendMessageAsync(new WelcomeMessage(), client, token);

            Message? message = await ReceiveMessageAsync(client, token);
            
            if (Clients.TryAdd(message.FromStationID, client))
            {
                await ReceiveMessagesAsync(client, token);
            }
        }
        finally
        {
            // Clean up closed connection
            if (client is not null)
            {
                if (client.StationID is not null)
                {
                    Clients.TryRemove(client.StationID, out _);
                }
                client.CloseTcpAndStream();
            }
        }
    }
    
    private static async Task SendMessageAsync(Message message, Client client, CancellationToken token)
    {
        //use message.Serialize(), then add 4 byte header to create byte[] buffer
        
        await client.WriteAsync(buffer, token);
    }
    
    private static async Task<Message?> ReceiveMessageAsync(Client client, CancellationToken token)
    {
        // use await client.ReadAsync to receive 4 byte header
        // use await client.ReadAsync to receive the message bytes into byte[] buffer
        
        return
            buffer is null ?
            null :
            Message.Deserialize(Encoding.UTF8.GetString(buffer, 0, buffer.Length));
    }

    private async Task ReceiveMessagesAsync(Client client, CancellationToken token)
    {
        while (client.IsConnected && !token.IsCancellationRequested)
        {
            Message? message = await ReceiveMessageAsync(client, token);
            
            if (token.IsCancellationRequested || message is null)
            {
                break;
            }

            await ProcessMessageByTypeAsync(message, token);
        }
    }

    private async Task ProcessMessageByTypeAsync(Message message, CancellationToken token)
    {
        if (message is null)
        {
            return;
        }
        else if (message is AckMessage ackMessage)
        {
            await ProcessAckMessageAsync(ackMessage, token);
        }
        else if (message is DataMessage || message is FeedbackMessage)
        {
            await ProcessDataMessageAsync(message, token);
        }
        // Ignore other messages
    }

    private async Task ProcessDataMessageAsync(Message message, CancellationToken token)
    {
        if (message is null)
        {
            return;
        }

        if (message.ToStationID != null)
        {
            _dbHandler.Insert(message);

            Client client;

            if (Clients.TryGetValue(message.ToStationID, out client))
            {
                await SendMessageAsync(message, client, token);
            }
        }
    }

    private async Task ProcessAckMessageAsync(AckMessage ackMessage, CancellationToken token)
    {
        _dbHandler.DeleteMessageByID(ackMessage.AckedMessageID);
    }
}

And the relevant DB code:

private ILiteCollection<Message> GetCollection(Message message)
{
    var msgCollection = _dataBase
        .GetCollection<Message>(DB.MESSAGES);

    msgCollection.EnsureIndex((Message m) => m.MessageID);

    return msgCollection;
}

private ILiteCollection<Message> GetMessages()
{
    return GetCollection((Message)null);
}

public BsonValue Insert(Message message)
{
    message.MessageID = 0;
    return GetMessages()
        .Insert(message);
}

public bool DeleteMessageByID(BsonValue messageID)
{
    return GetMessages()
        .Delete(messageID);
}

According to the LiteDB docs, writes are protected by a lock internally. Simplistic databases that use locks always end up with deadlock victims when under load. You'll just have to catch and retry (e.g., use Polly). — Stephen Cleary, Aug 03 '23 at 11:39
Thank you @StephenCleary! I did not understand what you meant. Can you please elaborate? (Also, what is Polly?) — Ungoliant, Aug 03 '23 at 11:41
Also, chatGPT recommended this (which I'm not sure is even true): Connection Pooling: LiteDB allows you to create multiple instances of the database and use them in a thread-safe manner. Create a pool of LiteDB instances and have each client connection acquire a separate instance from the pool when they need to interact with the database. This will reduce contention for the database lock. — Ungoliant, Aug 03 '23 at 11:43
I'm saying that you can't avoid this exception; it's inherent due to the way LiteDB is implemented. You'll need to catch it and retry. — Stephen Cleary, Aug 03 '23 at 11:45
Ok I understand now. However, that exception is not actually the problem. It seems to be merely a symptom of the actual problem, where after a few Insert's, only 1 Insert per second is called, instead of many. I'm not sure whether there's some LiteDB deadlock or too many tasks simultaneously. — Ungoliant, Aug 03 '23 at 11:52
Have you considered using a database that is designed to do this sort of thing, like PostGresql? — Robert Harvey, Aug 03 '23 at 11:58
I have not, since I was under the impression that LiteDB should be able to handle this (thread-safe, etc). I guess I was wrong? Could you elaborate on why and why PostGresql is better in this regard? — Ungoliant, Aug 03 '23 at 12:00
Postgres and MySql have fewer deadlocks, but they still happen. MSSql has even fewer deadlocks but still not zero. Deadlocks are due to the internal locking that most databases do in order to guarantee thread safety. The only deadlock-free architecture is single-writer. — Stephen Cleary, Aug 03 '23 at 12:07
_"The only deadlock-free architecture is single-writer"_ - which I believe you could do. On the other hand, I am starting to believe you could be better off with something like Redis, even. Then you wouldn't have to bother with manual cleanup of outdated entries. You just set a TTL and that's it. — Fildor, Aug 03 '23 at 12:27
That would make writing to the db your bottleneck and you would need to protect all write-ops by the writerlock (create, update and delete). Not like in your first attempt, where you only protected the "GC" as writer. — Fildor, Aug 03 '23 at 12:35
I personally would either find a better suited DB if there is _or_ try to minimize collisions. That means: As-is all clients access _one_ DB and collection. Maybe you could minimize colliding locks by chosing a different Key? Like use a MessageID _and_ a ClientID so you can (if LiteDB is capable of it) make it lock and access only a specific part of the collection? If that's not possible, then maybe one db per client? — Fildor, Aug 03 '23 at 12:41
^^ _"Per collection writer locks"_ - seems not possible to have partial write lock on a collection. So, either make different collections ( haven't tried that ). That may mean to get fancy with the built-in ORM ... or as I said: One DB per client. — Fildor, Aug 03 '23 at 12:45
`var col = db.GetCollection("customer");` - or not _that_ fancy after all. All collections must have a unique name. So you could have collections as `var col = db.GetCollection($"messages-{clientId}");` and LiteDB has writer locks per collection. I would imagine, that already could improve the situation immensly. — Fildor, Aug 03 '23 at 12:47
On "ProcessDataMessageAsync", if I comment out "await SendMessageAsync(message, client, token);" or "_dbHandler.Insert(message);", the problem is fixed. I wonder what this means. — Ungoliant, Aug 03 '23 at 13:47
Additionally, if I comment out "_dbHandler.DeleteMessageByID(ackMessage.AckedMessageID)", the issue is also solved. I wonder what this all means. — Ungoliant, Aug 03 '23 at 13:54
I don't know about SendMessageAsync but all of the other methods write to same collection. That means they all pile up waiting for the same lock. — Fildor, Aug 03 '23 at 14:18
I will test that for sure. Just trying other things before that. Is it possible that I am simply running too many tasks and the issue is not related to LiteDB at all? — Ungoliant, Aug 03 '23 at 16:17
BTW, it seems like the creator of LiteDB recommends a per client collection, so you may be on the right track. I will try next week and update: https://github.com/mbdavid/LiteDB/issues/397 — Ungoliant, Aug 03 '23 at 17:00
So, this intreagued me and I made a little Dummy App but I haven't had the time to benchmark it, yet. I went for a "one collection per client" approach, which I will post as soon as I can confirm, it actually solves your problem. I also need to clean it up from less-than-successful attempts' code residue ;) — Fildor, Aug 09 '23 at 07:23
I forgot to update here. Apparently, there's a bug in LiteDB: https://github.com/mbdavid/LiteDB/issues/2307 Indeed using per-client collections will probably make the issue a lot less likely to occur, however, will not solve it. For now, I am using a reader/writer async lock. I still benefit from LiteDB's batch writing it seems. In fact, the lock approach is surprisingly faster than using no locks and version 4.1.4 (where the issue does not occur). Did you manage to reproduce this issue? It would be great to share it with the LiteDB team. Would love an update and appreciate the help! — Ungoliant, Aug 09 '23 at 09:09

LiteDB Deadlocks - I may be using LiteDB wrong

0 Answers0

Linked