0

Problem Statement

I'm currently building an exchange scraper with three tasks, each running on its own process:

  • #1: Receive a live webfeed: very fast data coming in, immediately put in a multiprocessing Queue and continue.
  • #2: Consume queue data and optimize: consume and optimize it using some logic I wrote. Is slow but not too slow, eventually catches up and clears queue when data coming in is slow.
  • #3: Compress feed using bz2 and upload to my s3 bucket: Every hour, i compress the optimized data (to reduce file size even more) and then upload to my s3 bucket. This takes about 10-20 seconds on my machine.

The problem I'm having is that each of these tasks needs its own parallel process. The producer (#1) can't do the optimization (#2), otherwise it stalls the feed connection and the website kills my socket because thread #1 doesn't respond. The uploader (#3) can't be run on the same process as task #2 otherwise I'll fill up the queue too much, and I can never catch up. I've tried this: doesn't work.

This scraper works just fine on my local machine with each task on its own process. But I really don't want to spend a lot of money on a 3-core machine when this is deployed on a server. I found Digital Ocean's 4vCPU option is the cheapest at $40/mo. But I was wondering if there is a better way than paying for 4-cores.

Just some stuff to note: On my base 16" MBP, Task #1 uses 99% CPU, Task #2 uses 20-30% CPU, Task #3 sleeps until the turn of the hour, so it mostly uses 0.5-1% CPU.

Questions:

  • If I run three processes on a 2-core machine, is that effectively the same as running two processes? I know it depends on system scheduling, but does that mean it will stall on compression, or move along until compression is over? It seems really wasteful to spin up (and pay for) an entirely new core that only is used once an hour. But that hourly task stalls the entire queue too much and I'm not sure how to get around that.

  • Is there anyway I can continue Task#2 while I compress my files on the same process/core?

  • If I run a bash script to do the compression instead, would that still stall the software? My computer is 6-core so I can't really test the server's constraint locally

  • Are there any cheaper alternatives to DigitalOcean? I am honestly terrified from AWS because I've heard horror stories of people getting $1,000 bills for unexpected usage. I'd rather something more predictable like DigitalOcean

What I've Tried

As mentioned before, I tried combining Task#2 and Task#3 on the same process. It ends up stalling once the compression begins. Compression is synchronous and done using the code from this thread. Couldn't find asynchronous bz2 compression, but I'm not sure that would even help not stalling Task#2.


PS: I really tried to avoid coming to StackOverflow with an open question like this because I know these get bad feedback, but the alternative is trying out and putting a lot of time+money on the line when I don't know much about cloud computing to be honest. I'd prefer some expert opinions

JoeVictor
  • 1,806
  • 1
  • 17
  • 38
  • If you could list what sort of "live feed" it is you're working with, as well as generally what type of data reduction you're doing, it would be very helpful. For example: "incoming video stream with audio and chat messages reduced to keywords from chat" – Aaron Feb 04 '21 at 05:35
  • 1
    So I'm building a scraper to basically aggregate a cryptocurrency exchange's feed and store it. (I know i said social media scraper, but I thought I was not allowed to share that info, just got approval from my partner to share) It's raw order book data. I process/optimize it by only storing the change in price between ticks. I've made this part as efficient as possible.. one thing I can think of is maybe i shouldn't reopen the file on each tick, keep it open until I stop writing? – JoeVictor Feb 04 '21 at 15:36
  • 100% don't keep closing and re-opening the file... it's an OS call which isn't very fast. – Aaron Feb 04 '21 at 19:46

1 Answers1

1

bullet point #1:

All operating systems you'll run into use preemptive scheduling to switch between processes. This should guarantee each process gets resumed at least several times a second on any remotely modern hardware (as long as the process is using the cpu, and not waiting on an interrupt like file or socket io). Basically, it's not a problem at all to run even hundreds of processes on a 2 core cpu. If the total load is too much, everything will run slower, but nothing should completely stall.

bullet point #2:

Multithreading? you may find compressing / storing to be more IO limited, so a thread would probably be fine here. You may even see a benefit from reduced overhead from transferring data between processes (depending on how you currently do it) as a child thread has full access to the memory space of the parent.

bullet point #3:

A shell script is just another process, so not too different to answer #1. Do test this however, as python bzip may very well be much slower than shell bzip (depending on how you feed it data, and where it's trying to put it)...

bullet point #4:

Definitely not an appropriate question for S.O.

My recommendation:

Profile your code... Make the ingest process as efficient as possible, and send as little data between processes as possible. A process that is merely reading data from a socket, and sending it to be processed should be taking minimal cpu. The default multiprocessing.Queue isn't terribly efficient because it pickles data, sends it through a pipe, then unpickles it at the other end. If your data can be chunked into fixed size chunks, consider using a couple multiprocessing.shared_memory.SharedMemory buffers to swap between. Chunking the data stream should also make it easier to parallelize the data consumption stage to better utilize whatever cpu resources you have.

edit: pseudocodeish example of sending chunks of data via shared memory

import multiprocessing as mp
from contextlib import contextmanager
from collections import namedtuple
from ctypes import c_int8
import socket
import time

STALE_DATA = 0 #data waiting to be overwritten
NEW_DATA = 1 #data waiting to be processed

def producer_func(buffers):
    shm_objects = {}
    for buffer in buffers:
        shm_objects[buffer.shm_name] = mp.shared_memory.SharedMemory(name=buffer.shm_name, create=False)
        #buffer.state.value = 0 #value was initialized as stale at creation (data waiting to be overwritten)
    with socket.create_connection(...) as s: #however you're reading data
        while True: #for each chunk of data
            while True: #until we get an open buffer
                for buffer in buffers: #check each buffer object
                    #if buffer isn't being processed right now, and data has already been processed
                    if buffer.lock.acquire(False):
                        if buffer.state.value==STALE_DATA: 
                            shm = shm_objects[buffer.shm_name]
                            break #break out of two loops
                        else:
                            buffer.lock.release()
                else:
                    continue
                break
            
            s.recv_into(shm.buf) #put the data in the buffer
            buffer.state.value = NEW_DATA #flag the data as new
            buffer.lock.release() #release the buffer to be processed
    #when you receive some sort of shutdown signal:
    for shm in shm_objects:
        shm.close()

def consumer_func(buffers):
    shm_objects = {}
    for buffer in buffers:
        shm_objects[buffer.shm_name] = mp.shared_memory.SharedMemory(name=buffer.shm_name, create=False)
        #buffer.state.value = 0 #value was initialized as stale at creation (data waiting to be overwritten)
    while True: #for each chunk of data
        while True: #until we get a buffer of data waiting to be processed
            for buffer in buffers:
                #if buffer isn't being processed right now, and data hasn't already been processed
                if buffer.lock.acquire(False):
                    if buffer.state.value==NEW_DATA:
                        shm = shm_objects[buffer.shm_name]
                        break #break out of two loops
                    else:
                        buffer.lock.release()
            else:
                continue
            break
        process_the_data(shm.buf) #do your data reduction here
        buffer.state.value = STALE_DATA
        buffer.lock.release()
    #when you receive some sort of shutdown signal:
    for shm in shm_objects:
        shm.close()
        
Buffer = namedtuple("Buffer", ['shm_name', 'lock', 'state'])

if __name__ == "__main__":
    n_buffers = 4 # 4 buffers to swap between
    #each buffer should be bigger than you will ever expect a message to be.
    #using larger chunks is better for overhead (don't be processing chunks of less than a couple Kib at a time)
    shm_objects = [mp.shared_memory.SharedMemory(create=True, size=2**20) for _ in range(n_buffers)] # 1MiB buffers
    buffers = [Buffer(shm.name, mp.Lock(), mp.Value(c_int8, 0)) for shm in shm_objects] #building args for our processes
    producer = mp.Process(target=producer_func, args=(buffers, ))
    consumer = mp.Process(target=consumer_func, args=(buffers, ))
    consumer.start()
    producer.start()
    while True:
        try:
            time.sleep(1)
        except KeyboardInterrupt:
            break
    #signal child processes to close somehow
    #cleanup
    producer.join()
    consumer.join()
    for shm in shm_objects:
        shm.close()
        shm.unlink()
Aaron
  • 10,133
  • 1
  • 24
  • 40
  • #1: The problem is that if it's too slow, I fall behind on processing the messages. If I am behind by more than 30 seconds the exchange cuts my connection. I've had that happen with high volume times – JoeVictor Feb 04 '21 at 15:37
  • #2 I tried using `asyncio` to do Task #1 and Task #2 on the same Process. When the volume is high, connection stalls sadly... The problem i think is that there are two forms of "compression" here.... one is my own "optimizing" where I convert price ticks to "difference in prices", and the other is the bz2 compression to reduce file size even further. These are Task #2 and Task #3, respectively. Both Tasks #1 and Tasks#2 are time sensitive, so I can't stall them by adding Task#3 on the same core as them... – JoeVictor Feb 04 '21 at 15:42
  • #3 This is actually a great idea and probably the right approach... I guess I was just being lazy by wanting everything to be in Python. – JoeVictor Feb 04 '21 at 15:45
  • *Re: Your recommendation–* This is actually what I am doing. Exchange gives me a compressed byte array that I immediately just put in a tuple `(exchange_name, data)`. For some reason this process takes ~70% of CPU1 resources. There's just so much data coming in. – JoeVictor Feb 04 '21 at 15:48
  • Will look into using `shared_memory`. Have never used it in Python, sounds similar to CUDA shared memory which I do have experience with. – JoeVictor Feb 04 '21 at 15:48
  • 1
    Shared memory is unfortunately still quite new, and there are some pitfalls in working with it, but at its base, its backed by [posix shared memory](https://man7.org/linux/man-pages/man7/shm_overview.7.html) (or an equivalent win32). The general idea is minimal-copy processing which reduces overhead. I'll work on a short pseudocodeish example. – Aaron Feb 04 '21 at 19:37
  • 1
    @QuantumHoneybees I have added a rough example (not exactly working, but a framework at least) of how to create several shared memory buffers and swap between them with minimal data copy. several key factors here: each buffer must be bigger than your data chunks will ever be. They are fixed size and may be bigger or smaller than requested at creation. This also means you'll also have to pass the size of each message to the `process_the_data` function, which I did not cover (probably use another `mp.Value`? for each `Buffer` NamedTuple) – Aaron Feb 04 '21 at 21:06
  • 1
    each `shm` object has a `.buf` attribute which is a memoryview of the underlying chunk of memory. I'm just guessing this is compatible with `socket.read_into`, but it doesn't matter that much because you'll have to implement however you're getting your data anyway. You should endeavor however to do something similar, so as little copying as possible happens (read directly into the buffer rather than copy the data you already read into it.) – Aaron Feb 04 '21 at 21:11
  • 1
    I am incredibly thankful for this idea+pseudocode, truly. One question...Correct me if i'm wrong: - Isn't this still somewhat linear execution? In a way, this is creating a "queue" of `n_buffers` buffers to pass data through. If the consumer doesn't run `process_the_data` fast enough, producer has to wait for a `STALE_DATA` buffer? I guess I'd have to add the received bytes into a local queue on the consumer end and then immediately free up the buffers, correct? PS. this implementation is great because this means new exchanges can scale up in a lot of ways without affecting performance! – JoeVictor Feb 04 '21 at 21:50
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/228283/discussion-between-aaron-and-quantumhoneybees). – Aaron Feb 05 '21 at 00:18