Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
7
votes
1 answer

Locating segmentation fault for multithread program running on cluster

It's quite straightforward to use gdb in order to locate a segmentation fault while running a simple program in interactive mode. But consider we have a multithread program - written by pthread - submitted to a cluster node (by qsub command). So we…
Ali
  • 9,440
  • 12
  • 62
  • 92
6
votes
3 answers

How to send a message without a specific destination in MPI?

I want to send a message to one of the ranks receiving a message with a specific tag. If there is any rank received the message and the message is consumed. In MPI_Recv() we can receive a message using MPI_ANY_SOURCE/MPI_ANY_TAG but the MPI_Send()…
maxwong
  • 63
  • 2
  • 7
6
votes
1 answer

4000% Performance Decrease in SYCL when using Unified Shared Memory instead of Device Memory

In SYCL, there are three types of memory: host memory, device memory, and Unified Shared Memory (USM). For host and device memory, data exchange requires explicit copying. Meanwhile, data movement from and to USM is automatically managed by the SYCL…
比尔盖子
  • 2,693
  • 5
  • 37
  • 53
6
votes
1 answer

Detect Nvidia's NVC++ (not NVCC) compiler and compiler version

I am using Nvidia's HPC compiler nvc++. Is there a way to detect that the program is being compile with this specific compiler and the version? I couldn't find anything in the manual https://docs.nvidia.com/hpc-sdk/index.html. Another Nvidia-related…
alfC
  • 14,261
  • 4
  • 67
  • 118
6
votes
2 answers

How to configure batchscript to parallelize R script with future.batchtools (SLURM)

I seek to parallize an R file on an SLURM HPC using the future.batchtools packages. While the script is executed on multiple nodes, it only use 1 CPU instead of 12 that are available. So far, I tried different configurations (c.f. code attached)…
Michael K
  • 133
  • 5
6
votes
1 answer

Tuple Concatenation in Chapel

Let's say I'm generating tuples and I want to concatenate them as they come. How do I do this? The following does element-wise addition: if ts = ("foo", "cat"), t = ("bar", "dog") ts += t gives ts = ("foobar", "catdog"), but what I really want is ts…
Tshimanga
  • 845
  • 6
  • 16
6
votes
1 answer

Do Bluegene systems support ltdl or any other kind of dlopen() support?

so I have some code that uses dlopen for loading libraries, and I want it to work on a bluegene system, but I don't have a bluegene to test things on, and I've never directly worked with one. Does bluegene support ltdl.h, or does it use something…
Sam
  • 63
  • 2
6
votes
2 answers

SLURM Submit multiple tasks per node?

I found some very similar questions which helped me arrive at a script which seems to work however I'm still unsure if I fully understand why, hence this question.. My problem (example): On 3 nodes, I want to run 12 tasks on each node (so 36 tasks…
Shiwayari
  • 315
  • 3
  • 12
6
votes
1 answer

All jobs failing in C COMPSs execution

I have downloaded COMPSs 1.4 and some test programs from http://www.bsc.es/computer-sciences/grid-computing/comp-superscalar/downloads-and-documentation and I am trying to test them. Java executions went fine; however, I amb having problems with…
Adri A.P.
  • 63
  • 4
6
votes
1 answer

COMPSs Monitor doesn't show any application

I am running with COMPSs the Increment application shown in the COMPSs Sample Application Manual. I have added the -m flag to enable the monitoring feature: $ runcompss -m --debug increment.Increment 5 1 2 3 The application runs and finishes…
Cristian Ramon-Cortes
  • 1,838
  • 1
  • 19
  • 32
6
votes
1 answer

Shared disks with COMPSs

I have a cluster which has a shared disk between the different nodes. How can I configure COMP superscalar to take into account this shared disk in order to avoid file transfers?
Jorge Ejarque
  • 269
  • 1
  • 8
6
votes
0 answers

PBS : Fill up all cores in a node before going to the next node

By default PBS submits my serial jobs to all the nodes in a queue before using up more resources(cpus) from the nodes. Can I force PBS to submit my jobs to one node till it exhausts all the CPUS of that node (say 12 cpus; also given that the memory…
PyariBilli
  • 501
  • 1
  • 7
  • 17
6
votes
2 answers

Multiple levels of parallelism using OpenMP - Possible? Smart? Practical?

I am currently working on a C++ sparse matrix/math/iterative solver library, for a simulation tool I manage. I would have preferred to use an existing package, however, after extensive investigation, none were found that were appropriate for our…
MarkD
  • 4,864
  • 5
  • 36
  • 67
6
votes
8 answers

Please recommend an alternative to Microsoft HPC

We aim to implement a distributed system on a cluster, which will perform resource-consuming image-based computing with heavy storage I/O, having following characteristics: There is a dedicated manager computer node and up to 100 compute nodes. The…
Pavel Radzivilovsky
  • 18,794
  • 5
  • 57
  • 67
6
votes
2 answers

.net 4.0 Task Parallel Library vs. MPI.NET

Does .net 4.0 Task Parallel Library replace MPI.NET for High-performace computings? MPI.NET found here http://www.osl.iu.edu/research/mpi.net/svn/ is a high-performance, easy-to-use implementation of the Message Passing Interface (MPI) for…
Jalal El-Shaer
  • 14,502
  • 8
  • 45
  • 51