2

Let's say I have 100 text files

file_0.txt
file_1.txt
.
.
.
file_99.txt

and I want to read from them as fast as possible. I'm a software developer and don't have a great background in hardware. So I'm wondering if the "max degree of parallelism" is my # of CPUs? If I have 4 CPUs then should I try to read 4 files in parallel or will they read at ~1/4th the speed and not help with performance?

How about if I need to make 100 web requests and get their responses? How many whatever-hardware-port-thingys can be waiting for responses?

How can I predict the degree of parallelism to use?

user3666197
  • 1
  • 6
  • 50
  • 92
user7127000
  • 3,143
  • 6
  • 24
  • 41
  • You ask a few different questions that have different answers here. Reading files is a task that is bound by your disk drive. It can pretty much only read only file at a time so trying to read them parallel won't help much. Your second question about web requests is very different. Making your requests asynchronously could save a lot of time since a lot of the time is spent waiting for a response. – burnttoast11 Dec 22 '17 at 06:57
  • When you do things in parallel, there are two kinds of bounds. Either you something that is primary cpu bound or something that is I/O bound. Reading multiple files from disk is definitive I/O bound. So it is a perfect task for `async await` (while cpu bound would be something for Task Parallel Library). The main bottleneck in your case will be your HDD. So to ensure here good performance use SSD on a good RAID controller card with a RAID level that improves reading (e.g. level 10). – Oliver Dec 22 '17 at 07:03
  • @burnttoast11 I get that about web requests, but what are the most that can reasonably be "in flight" at once? – user7127000 Dec 22 '17 at 07:04
  • If you handle the responses, then you will have some CPU work aftet each response arrives, which will put a limit on reasonable number of requests in flight (depending on how long does it take to process response compared to time spent waiting for it). – Evk Dec 22 '17 at 07:13
  • this might be relevant: http://sgdev-blog.blogspot.dk/2014/01/maximum-concurrent-connection-to-same.html – Morten Bork Dec 22 '17 at 07:35

1 Answers1

3

Well, this is by far not a case of a true-[PARALLEL] process ( scheduling ), even if your professor or wannabe-"nerds" try to call it that way.

There is no way to move 100 cars, side by side, in [PARALLEL]
across a bridge
that has just a one pure-[SERIAL] lane over a river.

As declared above, the fileIO is a "just"-[CONCURRENT] process, there is no such device ( be it a spinning disk, or any form of NAND/FLASH-SSD disk-emulation device ), that could read and smoothly-deliver data from 100-different file locations at the very same time.

The maximum one can expect is to hide some part of the non-CPU part of the process-flow ( buffer & controller cache re-ordered fileIO may mask some part of the principal ~ 10 [ms] seek-time ( not more than ~ 125 seeks per second, even on RAID ) and data-flow will never go above ~ 250 [MB/s/disk] on classical spinning disk, network-transport latency + remote process-handling in case of a web-request will always accrue ~ from units to small hundreds of [ms] just for L3-TCP/IP-RTT-latency + add whatever remote-processing takes ).

If going into domain of high-performance, one will definitely have to go into proper understanding of hardware, because all software high-level constructors expect users to understand cons and pros ( and in most cases, do not leave all the hardware-related decisions to user, so in most cases, one ought benchmark settings against the respective hardware platform to identify / validate, if such respective software-constructor indeed delivers any beneficial effects on the process performance, or not -- losing way more than receiving is a very common surprise in this domain, if a blind-belief or naive-implementation gets indeed benchmarked ).


Q: How can I predict the degree of parallelism to use?

A:
An analytical approach -- IDENTIFY the most narrow bridge in the game:
Go as deep into the real-system hardware infrastructure the code will be deployed at, so as to identify the weakest processing-chain element in the computing graph ( The very bridge, with the least number of true-parallel lanes - fileIO having ~ 1-lane, 4-core CPU having ~ 4-lanes ( may have more than 8-lanes, if having 2-ALU per CPU-core and doing only some well done locality-preserved heavy number-crunching ), 2-channel DRAM having ~ 2-lanes, etc. )

An experimental approach -- MEASURE performance of all the possible combinations:
If not willing to spend such efforts or if such information is not available in sufficient level of detail for analytical approach, one may prepare and run a set of blind brute-force black-box benchmarking experiments, measuring the in-vivo performance effects of the controlled levels of concurrency / locally deployed fine-grain parallelism tricks. Experimental data may indicate directions, that may yield beneficial or adverse effects on the resulting End-to-End process performance.

Known Limitations:
There is no such thing as a repeatable controlled experiment, if going outside of a localhost ( local-area / wide-area networks background traffic workload envelopes, remote firewalls, remote processing-node(s), spurious intermittent workloads on any of the mediating devices -- all this simply prevents an experiment to be repeatable per-se, the less to be anything more, than just a one sample in some remarkably large empirical performance testing DataSET, if the results aim to have some relevance for final decision ( 10x, 100x, 1000x being not a measure, if in a serious need to cover various background workloads affected performance assessment of each of the experimental setup combinations ) ). Also may need to check a remote-website Terms & Conditions, as many API providers implement daily-use limiting / rate-trimming policies, so as not to get onto their respective blacklist / permanent ban, right due to violating these Terms & Conditions.


Epilogue for complete view & technology-purists:
Yes, there are indeed strategies for advanced, HPC grade, processing performance, that allow for circumventing this principal bottleneck, but it is not probable to have implemented such a kind of an HPC parallel filesystem on common mortals' user lands, as supercomputing resources belong rather to well financed federal- / EU- / government-sponsored R&D or mil/gov institutions, that operate such HPC-friendly environments

user3666197
  • 1
  • 6
  • 50
  • 92
  • 1
    So what is the answer to question OP asks? – Evk Dec 22 '17 at 09:18
  • @Evk would you mind to kindly specify which particular one of the four questions, lain one after another above, do you want to address without going heavily into the hardware specific details? – user3666197 Dec 22 '17 at 09:35
  • I think the last one. If OP wants to make 100 web requests, how would he figure out the optimal number of concurrent web requests to achieve maximum throughput? Doing them one by one is not optimal because of latency and time spent by server to prepare response, but issue all 100 at once is probably also not a best way. – Evk Dec 22 '17 at 09:39
  • *there are indeed strategies for advanced, HPC grade, processing performance, that allow for circumventing this principal bottleneck* You don't really need to go that far, you just need a high-end filesystem that supports parallel IO, hardware set up to support parallel IO, both properly configured, and software written to properly take advantage of the number of parallel IO operations you've built your system to support. – Andrew Henle Dec 22 '17 at 14:48
  • *125 seeks per second* and *250 [MB/s/disk]* In my experience, both of those are extremely optimistic for hardware most people use. Most disks out there won't support half that. A consumer-grade 5,400-RPM SATA drive might be down at 40 seeks/sec. And it might reach at most 70-80 MB/s - and that's only if data is streamed in large chunks with no intervening seeks. That's the problem with parallel IO to low-end hardware - much more time gets spent in seeking. It's like your bridge analogy, but you have to run cars in both directions. – Andrew Henle Dec 22 '17 at 14:53
  • ( Right, forgotten the days of 5k4 RPM spindles .. ) both figures come from devices with top-ranking ambitions ( sure, O/S moves from FAT-cylinders to partition-data cylinders and back are the first "halving" factor, the SATA-external-(but-sub-SATA-internal on drive -- hopefully at least behind a buffer -- micro-controller design )-performance surprises are the next one of nasty adverse surprises we meet in real-world devices **:o)** – user3666197 Dec 22 '17 at 15:22
  • @AndrewHenle going into a true-[PARALLEL] design is indeed an expensive journey ( ref. the overhead-strict, resources-realistic, process-atomicity respecting re-formulated Amdahl's Law ). Given that, one rather avoids any sort of waiting for fileIO at all, with systems supporting a few hundreds [TB] of contiguous RAM [SPACE]-domain, the [TIME]-domain losses are many orders of magnitude smaller and thus may support going true-[PARALLEL] at an acceptable cost, yet delivering speedups >> 1.00 Using any other implementation strategy is just a waste of resources on a principally lost game in [TIME] – user3666197 Dec 22 '17 at 15:29