Socket server crashing in multithreaded program

Question

I have a standard socketserver that looks a bit like:

import time
import socketserver
import threading
import io


class Handler(socketserver.StreamRequestHandler):
    def handle(self):
        return cache.handle(self.rfile, self.wfile)


class Cache:
    def __init__(self):
        self._runner = Runner(self.reload)
        self._runner.start()
        self.cache = {}

    def reload(self):
        # very long process that takes up to 90 minutes and 100 GB of RAM
        # involves calls to pyodbc and some python processing
        # here is a dummy process
        cache = {}
        for i in range(90):
             cache[str(i).encode()] = bytes(10**9)
             time.sleep(60)
        self.cache = cache

    @staticmethod
    def _send_bytes(wfile: io.BufferedIOBase, msg: bytes) -> None:
        wfile.write(len(msg).to_bytes(4, "big"))
        wfile.write(msg)


    def handle(self, rflie, wfile):
        request = rfile.readline()
        response = self.cache.get(request, b'')
        self._send_bytes(wfile, response)


class Runner(threading.Thread):
    """class that runs a process every timer seconds (1 day by default)"""
    def __init__(self, proc, timer=24*60*60):
        super().__init__(target=self._target)
        self.proc = proc
        self.event = threading.Event()
        self.timer = timer

    def _target(self):
        while not self.event.wait(self.timer):
            self.proc()


if __name__ == '__main__':
    cache = Cache()
    with socketserver.TCPServer(("0.0.0.0", 48888), Handler) as server:
        server.socket.settimeout(30)
        server.serve_forever()

Here's the problem, every time the reload runs in it's own thread, the server becomes exceptionally slow at responding to requests (of which it will get a few a minute). In fact it will become near unresponsive (the clients will timeout) and the whole system will get into a state where every request takes so long to respond that a backlog of clients will build up that never get answered. The server just stops handling requests (but the loader still runs every 24 hrs).

My understanding of python is that even though the other loader thread takes time, it shouldn't be holding the GIL like this, but that seems to be what is happening.

Edit:

This is (roughly) what the reload method above calls:

import collections
from typing import List, Any, Dict

import pyodbc

CONN_STR = "foo"
QUERY_STR = "bar"

def load_all() -> Dict[bytes, List[ValueObject]]:
    # this step takes ~10 minutes
    # it connects to a sql server db using pyodbc
    conn: pyodbc.Connection
    with pyodbc.connect(CONN_STR) as conn:
        cursor: pyodbc.Cursor = conn.execute(QUERY_STR)
        rows: List[Any] = cursor.fetchall()

    # this step occurs entirely in pure python (a small amount gets delegated to pandas)
    # it is almost entirely building objects/dicts/lists using information from the rows
    # it makes no external calls or performs any IO (other than logging)
    # it takes ~ 40 minutes normally
    d = {}
    for row in rows:
        value = ValueObject.build(row)
        d[value.key] = value

    results = collections.defaultdict(list)
    for k, v in d.items():
        results[grouper_func(k)].append(v)

    return results

You didn't implement `class Runner(threading.Thread):` correct. Read up on [Thread Objects](https://docs.python.org/3/library/threading.html?highlight=threading%20thread#thread-objects) which class method to overload and how to start a `Thread`. — stovfl, Jun 15 '20 at 08:07
@stovfl apologies, I missed out the init of `Runner.event`. However, I can absolutely guarantee that the implementation works. Try it yourself: `t = Runner(lambda: print('hi'), 10); t.start()` — FHTMitchell, Jun 15 '20 at 08:18
@MarquisofLorne No the real code is a probably over a few thousand lines. I've tried to make the example code a minimal representation of what goes on. — FHTMitchell, Jun 15 '20 at 08:27
But you didn't succeed. The incorrect code that starts the thread above isn't the real code. So fix that. And then ensure that the code you posted exhibits the problem you posted about. — user207421, Jun 15 '20 at 08:36
You'll notice I've edited the question, it is no longer incorrect. The actual implementation takes a few more params such as arguments to `self.proc` and has a slightly more complicated way of calculating the wait time, but that's irrelevant. — FHTMitchell, Jun 15 '20 at 08:37
In that case `self._runner.run()` is still incorrect, as noted by @stovfl, and your comment above claiming to use `.start()` ditto. — user207421, Jun 15 '20 at 08:38
I see, fixed! Apologies for the confusion. We actually have a run method overloaded in the real case but anyway, this is now working for me. I don't think running the code will actually help you though. This is more a question of how python behaves. — FHTMitchell, Jun 15 '20 at 08:45
So, does the code shown above actually reproduce the problem? If it doesn't, we're reduced to speculating about what your real `reload` method is doing. The only thing I can really suggest is that you learn to use strace and/or dtrace, iostat, vmstat, sar etc. etc. to figure out what is happening in your real system. — Useless, Jun 15 '20 at 08:46
@Useless probably not. Not very helpful I admit, but unless you want me to start mocking db calls and building of objects in the question (evenassuming your machines have a solid 100 GB of RAM) then I don't think that would be useful either. — FHTMitchell, Jun 15 '20 at 08:53
That's reasonable, but providing code that doesn't reproduce the problem _and_ has its own mistakes is just a distraction. At least you could describe what sort of Python operations and what sort of syscalls the reload operation makes. Also, if it temporarily uses 100Gb of RAM, it would be good to know how much physical memory you have. The slowdown could just be paging for all I can tell. — Useless, Jun 15 '20 at 09:00
@Useless fair points. I'll try and edit my question to make it more clear what the `reload` method does. I suppose I was hoping I'd done something stupid in terms of the socketserver/threading (not including forgetting how to start threads in python). I have around 315 GB locally and some even more stupid amount in kubernetes and this occurs in both. — FHTMitchell, Jun 15 '20 at 09:06
OK, so I can't think of anything obviously wrong. It could easily be that `reload` thrashes the GIL (strace and look for futex syscalls, if Linux), but it could equally be starving the sockets (with DB I/O, maybe look at queue length in netstat) or causing a lot of paging activity. — Useless, Jun 15 '20 at 09:11
So I did the strace and I'm getting a few of these a second: `poll([{fd=9, events=POLLIN}], 1, 500) = 0 (Timeout) futex(0x55555589f938, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1592214139, tv_nsec=132885000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)`. Is that normal? — FHTMitchell, Jun 15 '20 at 09:45
Here's a thought. Monitor the increase in TIME+ of the thread executing `reload` using `top -H -p ` from the moment it starts running a cycle till the cycle completes and also note the elapsed time. The percentage of the increase in TIME+ compared with the elapsed time gives you the amount of time GIL was held by the reload thread. If this is really high, then you know the main thread is just starved due to GIL. — singhatulks, Jun 19 '20 at 06:23

Socket server crashing in multithreaded program

0 Answers0