Problem Statement
I'm currently building an exchange scraper with three tasks, each running on its own process
:
- #1: Receive a live webfeed: very fast data coming in, immediately put in a
multiprocessing
Queue and continue. - #2: Consume queue data and optimize: consume and optimize it using some logic I wrote. Is slow but not too slow, eventually catches up and clears queue when data coming in is slow.
- #3: Compress feed using
bz2
and upload to my s3 bucket: Every hour, i compress the optimized data (to reduce file size even more) and then upload to my s3 bucket. This takes about 10-20 seconds on my machine.
The problem I'm having is that each of these tasks needs its own parallel process
. The producer (#1) can't do the optimization (#2), otherwise it stalls the feed connection and the website kills my socket because thread #1 doesn't respond. The uploader (#3) can't be run on the same process as task #2 otherwise I'll fill up the queue too much, and I can never catch up. I've tried this: doesn't work.
This scraper works just fine on my local machine with each task on its own process. But I really don't want to spend a lot of money on a 3-core machine when this is deployed on a server. I found Digital Ocean's 4vCPU option is the cheapest at $40/mo. But I was wondering if there is a better way than paying for 4-cores.
Just some stuff to note: On my base 16" MBP, Task #1 uses 99% CPU, Task #2 uses 20-30% CPU, Task #3 sleeps until the turn of the hour, so it mostly uses 0.5-1% CPU.
Questions:
If I run three processes on a 2-core machine, is that effectively the same as running two processes? I know it depends on system scheduling, but does that mean it will stall on compression, or move along until compression is over? It seems really wasteful to spin up (and pay for) an entirely new core that only is used once an hour. But that hourly task stalls the entire queue too much and I'm not sure how to get around that.
Is there anyway I can continue Task#2 while I compress my files on the same process/core?
If I run a bash script to do the compression instead, would that still stall the software? My computer is 6-core so I can't really test the server's constraint locally
Are there any cheaper alternatives to DigitalOcean? I am honestly terrified from AWS because I've heard horror stories of people getting $1,000 bills for unexpected usage. I'd rather something more predictable like DigitalOcean
What I've Tried
As mentioned before, I tried combining Task#2 and Task#3 on the same process. It ends up stalling once the compression begins. Compression is synchronous and done using the code from this thread. Couldn't find asynchronous bz2 compression, but I'm not sure that would even help not stalling Task#2.
PS: I really tried to avoid coming to StackOverflow with an open question like this because I know these get bad feedback, but the alternative is trying out and putting a lot of time+money on the line when I don't know much about cloud computing to be honest. I'd prefer some expert opinions