Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
6
votes
1 answer

How to use multiple nodes/cores on a cluster with parellelized Python code

I have a piece of Python code where I use joblib and multiprocessing to make parts of the code run in parallel. I have no trouble running this on my desktop where I can use Task Manager to see that it uses all four cores and runs the code in…
derNincompoop
  • 672
  • 11
  • 22
6
votes
3 answers

Efficiently computing floating-point arithmetic hundreds of thousands of times in Bash

Background I work for a research institute that studies storm surges computationally, and am attempting to automate some of the HPC commands using Bash. Currently, the process is we download the data from NOAA and create the command file manually,…
Jonathan E. Landrum
  • 2,748
  • 4
  • 30
  • 46
6
votes
2 answers

Benefits of contiguous memory allocation

In terms of performance, what are the benefits of allocating a contiguous memory block versus separate memory blocks for a matrix? I.e., instead of writing code like this: char **matrix = malloc(sizeof(char *) * 50); for(i = 0; i < 50; i++) …
wolfPack88
  • 4,163
  • 4
  • 32
  • 47
6
votes
1 answer

SunGridEngine, Condor, Torque as Resource Managers for PVM

Anyone have any idea which Resource manager is good for PVM? Or should I not have used PVM and instead relied on MPI (or any version of it, such as MPICH-2 [are there any other ones that are better?]). Main reason for using PVM was because the…
Tyug
  • 443
  • 2
  • 7
  • 21
6
votes
1 answer

Fastest computation of sum x^5 + x^4 + x^3...+x^0 (Bitwise possible ?) with x=16

For a tree layout that takes benefit of cache line prefetching (reading _next_ cacheline is cheap), I need to solve the address calculation in a really fast way. I was able to boil down the problem to: newIndex = nowIndex + 1 +…
user1610743
  • 840
  • 8
  • 24
6
votes
1 answer

Weak vs Strong Scaling Speedup and Efficiency

I have a theoretical question. As you know, for the analysis of scaling, the speedup is defined as S(N) = T(1) / T(N) where T(i) is the runtime with i processors. The efficiency is then defined as E(N) = S / N. These definitions make perfect sense…
Chris
  • 2,030
  • 1
  • 16
  • 22
5
votes
1 answer

Does anyone have an example where _mm256_stream_load_si256 (non-tempral load to bypasse cache) actually improves performance?

Consider massiveley SIMD-vectorized loops on very large amounts of floating point data (hundreds of GB) that, in theory, should benefit from non-temporal ("streaming" i.e. bypassing cache) loads/store. Using non-temp store (_mm256_stream_ps)…
zx-81
  • 103
  • 5
5
votes
2 answers

How to successfully compile mpi4py using MS HPC Server 2008 R2's MPI stack?

So the story goes: I need a MPI wrapper for Python. I know there's mpi4py. For the current work I (mostly) use Python and Windows, I'd like to use the Microsoft HPC Cluster Pack, having access to a few pretty "strong" machines running Win 2008…
5
votes
2 answers

Which AVX and march should be specified on a cluster with different architectures?

I'm currently trying to compile software for the use on a HPC-Cluster using Intel compilers. The login-node, which is where I compile and prepare the computations uses Intel Xeon Gold 6148 Processors, while the compute nodes use either Haswell-…
Wulle
  • 153
  • 4
5
votes
2 answers

Callback functions in Chapel

I have the following Chapel code. proc update(x: int(32)) { return 2*x; } proc dynamics(x: int(32)) { return update(x); } writeln(dynamics(7)); I would like to send some kind of callback to dynamics, like proc update(x: int(32)) { return…
Brian Dolan
  • 3,086
  • 2
  • 24
  • 35
5
votes
2 answers

File ownership and permissions in Singularity containers

When I run singularity exec foo.simg whoami I get my own username from the host, unlike in Docker where I would get root or the user specified by the container. If I look at /etc/passwd inside this Singularity container, an entry has been added to…
rgov
  • 3,516
  • 1
  • 31
  • 51
5
votes
2 answers

snakemake cluster script ImportError snakemake.utils

I have a strange issue that comes and goes randomly and I really can't figure out when and why. I am running a snakemake pipeline like this: conda activate $myEnv snakemake -s $snakefile --configfile test.conf.yml --cluster "python $qsub_script"…
soungalo
  • 1,106
  • 2
  • 19
  • 34
5
votes
1 answer

How to clear a PBS job dependency using qalter?

Say I sent a job with a dependency using qsub -W depend=afterok:JOBID to the cluster, how to I clear it with qalter command (using PBSpro scheduler)? I found Some info in the qalter man page, but couldn't find how to clear it, just how to create a…
IsoBar
  • 405
  • 3
  • 10
5
votes
1 answer

How do I get meaningful results from gprof on an MPI code?

I am optimising an MPI code and I am working with Gprof. The problem is that the results I obtained are completely unreasonable. My workflow is the following: compiling the code adding -pg as a compilation flag. running the code mpirun -np Nproc…
John Snow
  • 85
  • 4
5
votes
1 answer

PBS jobs inter-dependency: one job starts, cancel others

I would like to submit a simulation to several queues on my cluster. As soon as one queue would start it, it would be cancelled on the others. I understand it is potentially ill-defined as several jobs could start at the same time on several…
user1824346
  • 575
  • 1
  • 6
  • 17