Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
4
votes
3 answers

what is the difference between a NVIDIA Quadro 6000 and Tesla C2075 graphic cards?

I am looking into GPU computing and I can't figure out what the technical / performance differences are between a NVIDIA Quadro 6000 and a NVIDIA Tesla C2075 graphics card. They both have 6GB of RAM and the same number of computing cores. So what's…
memyself
  • 11,907
  • 14
  • 61
  • 102
4
votes
3 answers

When not to use MPI

This is not a question on specific technical coding aspect of MPI. I am NEW to MPI, and not wanting to make a fool of myself of using the library in a wrong way, thus posting the question here. As far as I understand, MPI is a environment for…
4
votes
0 answers

Slurm: automatically shutdown job with inactive GPUs

Is it possible to configure Slurm in such a way that it automatically releases/shuts down a job with low-usage / inactive GPU(s)? So far the only option I find is to have a global job time limit but nothing smarter than that.
Leo Gallucci
  • 16,355
  • 12
  • 77
  • 110
4
votes
2 answers

Running non-python scripts remotely using PyCharm

I am using PyCharm to do remote deployment and execution of python on an SSH server. However, I would also like to be able to run other files directly in the same way. For example, I would like to "run" a "job.run" script through sbatch to submit…
Tristan Maxson
  • 229
  • 4
  • 15
4
votes
4 answers

What's the difference between code written for a desktop machine and a supercomputer?

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might…
hatmatrix
  • 42,883
  • 45
  • 137
  • 231
4
votes
0 answers

How do you run as many tasks as will fit in memory, each running all all cores, in Windows HPC?

I'm using Microsoft HPC Pack 2012 to run video processing jobs on a Windows cluster. A run is organized as a single job with hundreds of independent tasks. If a single task is scheduled on a node, it uses all cores, but not at nearly 100%. One…
4
votes
1 answer

When exactly does SLURM export environmental variables?

With the option --export=ALL, the current environmental variables should be visible to the job script when submitted as sbatch --export=ALL jobscript.sh My question is, when exactly does SLURM do the export? Does the export happen when the job is…
Botond
  • 2,640
  • 6
  • 28
  • 44
4
votes
1 answer

Boost::Intrusive for HPC

How good is boost::intrusive library for high performance computing? I want to use a container for a non-copyable non-assignable class. I was planning to use normal STL with shared_ptr. I found out that boost::intrusive can also be used for the same…
user796530
4
votes
3 answers

Edit runscript of singularity .sif container after building

I have build a singularity container and uploaded it to my HPC service. Is there a way to change the runscript of the .sif file without rebuilding the whole container? I have a shell on the service. From my understanding of singularity this should…
Unlikus
  • 1,419
  • 10
  • 24
4
votes
1 answer

Transferring arrays/classes/records between locales

In a typical N-Body simulation, at the end of each epoch, each locale would need to share its own portion of the world (i.e. all bodies) to the rest of the locales. I am working on this with a local-view approach (i.e. using on Loc statements). I…
kianenigma
  • 1,365
  • 12
  • 20
4
votes
1 answer

Intel OpenMP library slows down memory bandwidth significantly on AMD platforms by setting KMP_AFFINITY=scatter

For memory-bound programs it is not always faster to use many threads, say the same number as the cores, since threads may compete for memory channels. Usually on a two-socket machine, less threads are better but we need to set affinity policy that…
AeroD
  • 53
  • 7
4
votes
4 answers

C++ Classes for High Performance Computing

According to this Quora forum, One of the simplest rules of thumb is to remember that hardware loves arrays, and is highly optimized for iteration over arrays. A simple optimization for many problems is just to stop using fancy data structures and…
4
votes
1 answer

Efficient collection and transfer of scattered sub-arrays in Chapel

Recently, I came across Chapel. I liked the examples given in the tutorials but many of them were embarrassingly parallel in my eyes. I'm working on Scattering Problems in Many-Body Quantum Physics and a common problem can be reduced to the…
4
votes
4 answers

Efficiently print every x iterations in for loop

I am writing a program in which a certain for-loop gets iterated over many many times. One single iteration doesn't take to long but since the program iterates the loop so often it takes quite some time to compute. In an effort to get more…
4
votes
1 answer

dask.distributed SLURM cluster Nanny Timeout

I am trying to use the dask.distributed.SLURMCluster to submit batch jobs to a SLURM job scheduler on a supercomputing cluster. The jobs all submit as expect, but throw an error after 1 minute of running: asyncio.exceptions.TimeoutError: Nanny…
Ovec8hkin
  • 65
  • 1
  • 6