Is there a better way to handle 100-200+ files/s?

Question

I’ve got a challenge that as it stands right now, is hitting limitations of cloud computing in terms of IOPS and CPU. The thought is to bring these systems in-house long term but I think it can be architected in a better way to make better use of the resources available.

App A writes anywhere from 100-200+ files per second to the file system. This file system used to be a remote mounted file system but is now being written locally to get the most IOPS we can. We are presently writing to block storage at about 200-300MB/s.

App B remotely mounts this file system and parses these files and pushes data into a MySQL DB. After it performs this function, it deletes the file. This app is extremely CPU intensive. We are working on a rewrite in a more efficient, multithreaded language.

We are working on making the parsers more efficient, but in the interim we need to find a way to improve the whole write/read process.

If I have more than 10 parse servers working on the files, it causes enough IO wait on App A’s server that it tips it over. If we have a central file server, it can’t handle the IOPS either causing extremely high load averages.

Is there better options than writing/reading from a file system?

I’m limited to cloud based product offerings right now and scaling out our present solution to where we need to be will cost us over $1m/yr.

SeverFault is not really the best Q&A platform to ask about system design and software architecture, I think [Software Engineering](https://softwareengineering.stackexchange.com/tour) might be a better place to get an answer to your question *"What are better options than writing/reading from a file system?"* As there are probably many alternatives to using a file system as a queue. — HBruijn, Oct 07 '17 at 18:36
@HBruijn I am new to ServerFault, so please pardon my post. However, is an infrastructure based question appropriate for Software Engineering? — brenden, Oct 10 '17 at 16:39
@BrendenMcEwan I didn't close or vote down on your question because there might be an infra solution other than deciding that your application is currently not providing a *"cloud workload"* and you should either throw more/better/bigger servers and higher speeds interconnects at the problem or rewrite the code so it will scale well in the cloud: for instance by using a distributed queue and software defined storage. But I think, and the answers below seem to support that, a rewrite of your code is indeed needed and that may result in better advice from Software Engineering SE — HBruijn, Oct 11 '17 at 07:12

score 2 · Answer 1 · answered Oct 07 '17 at 19:49

This sounds like an AWS Architect Pro exam question. It seems fairly straightforward to solve the scale and price concerns. There are many options, here's the first one that came to me.

If you'd said what cloud you were using you'd probably get better advice. Most clouds offer similar features, so you're probably ok whichever one you use. You can use AWS S3 and SQS no matter what cloud you're in, but you should use the features native to your cloud to keep costs down. Bandwidth can be expensive and latency could make a difference.

Have the writing application store files in a private S3 bucket. S3 will scale as high as you need. Be careful with your file naming - if you do it wrong you will bottleneck yourself. Read this.
Put a message onto an SQS message queue with the location of the file on S3, plus any other commands
Set up an RDS database if you need a database.
Have an auto-scaling group of spot instances that reads from the queue and processes the file. Have it scale on queue size, which is a built in metric. If your application isn't threaded and you can only run one instance per server use many small instances.
You could have a second group of auto scaling on-demand instances that scales up at higher thresholds than the spot instance group. This is probably a bit tricky / fiddly and I'm not 100% sure how to do it.

Using spot instances and S3 rather than on-demand instances and file systems I expect your bill should drop significantly. It will take a little bit of development work to use SQS and S3, but not that much, the APIs are good and there are many examples around.

Thanks @Tim. We are a bit fragmented right now in who we are using for our cloud services. We are looking to consolidate but not quite there yet. Thank you for the suggestions, I will be looking deeper into SQS a bit deeper. — brenden, Oct 10 '17 at 16:40

mc0e · Answer 2 · 2017-10-10T17:21:08.970

Instead of writing to many files, perhaps you could send those chunks of data to one process (or a cluster) which writes them in series to some sort of archive file. Maybe tar might be suitable. Writing 300MB/sec to a single file is not much load even on a HDD.

Also, look at having something other than a remote file mount. A large number of read-write users of a network file system suggests locking issues, particularly on directory nodes. Probably you would be better with some job runner on the source machine picking up files and sending them to some sort of server process. E.g. HTTP PUT straight to the processes that write to the DB.

Take a look over the job queue offerings. E.g. RabbitMQ. It sounds like you might be doing something that would suit that sort of architecture.

Thanks for the input @mc0e. I am looking into some kind of message queue to see if that can help alleviate some resource contention. — brenden, Oct 10 '17 at 16:44

Is there a better way to handle 100-200+ files/s?

2 Answers2