Cache-concious design of Master-Worker processes

Question

I recently started working on a server application designed with the familiar Master-Worker pattern with threads, where one privileged thread manages several worker threads. I have now realized how troublesome threads truly are.

I am now considering the possibility of moving to processes instead of threads, because they solve a lot of the issues I'm experiencing.

However, performance is a major concern, which I fear will decline as the memory usage rises due to duplicated data (lookup tables, context data, etc.) contending for space in the L2/L3 caches. This data needs to be occassionally modified, and may grow quite large.

hash_table files; 

function serve_file(connection, path)
    file = hash_table[path]
    sendfile(connection.fd, file.fd, 0, file.size);

function on_file_added_to_server_root(which)
    files.add(which, ...)

Given N worker processes, it'd be a shame if there were N copies of this table. However, some tables would be perfect to have duplicated everywhere. But then there's also numerous malloc(3)-allocated memory that could potentially be shared, but may be scattered all over, causing random pages to be duplicated due to copy-on-write.

Are there any tricks or general strategies to keep memory usage tight in multi-process designs?

Thanks!

There is nothing magical about threads vs. processes for problems of race conditions. Are you using a discplined approach for thread-to-thread communication or just a shared memory free-for-all? — srking, Oct 01 '12 at 22:07
@srking I overlooked this in my notes, sorry. These races would go away with a set of per-process lookup tables. However, there are numerous tables and other resources that are identical, but written to after `fork(2)` that ought not to be shared. I will clarify my question. — haste, Oct 01 '12 at 23:21
Cache behavior varies from one CPU model to the next. For example, some Intel CPUs avoid the shared state for data in per-core caches. You may think you're doing well to share a table of data, only to have the cache lines moving from core-to-core like crazy. Are you sure your app is memory performance limited? General threading bottlenecks like waiting on locks often matter much more. Remember Knuth's quote, "premature optimization is the root of all evil" — srking, Oct 02 '12 at 05:22
@srking Are there CPU:s that cause cache line bouncing just for reading shared data? I isolate as much as possible, and these tables are the only thing they all use to conserve memory. I use lock-free code, but this table in particular is proving difficult to implement using atomics only due to its access nature. When comparing threads to processes, duplicated tables and generally higher memory usage and their impact on the CPU caches is my only concern. — haste, Oct 02 '12 at 10:17
Yes, depending on the workload, allowing 'S' state for a data cache line may perform worse that requesting the line Exclusively. I believe S state is mostly used in the icache. However, I would not worry overmuch about this. Cross snoops between cores above a shared L3 cache are fast and your workload probably won't feel a pinch. If the table is large (greater than 32KB = L1 size), then I would go with shared data to get best use from your L3 cache. — srking, Oct 02 '12 at 16:36
@srking This is very interesting information. May I ask how you have obtained this knowledge? I'd love to delve into the nitty-gritty details of this. Thanks! — haste, Oct 02 '12 at 18:46

Cache-concious design of Master-Worker processes

0 Answers0