Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
4
votes
1 answer

Windows HPC and Azure

My company is testing/comparing various grid+cloud alternatives to perform scientific calculation. I have read the interesting white paper on the HPC/Azure topic by David Chappell, which is excellent from a "concepts" perspective, but as this is…
Mehdi LAMRANI
  • 11,289
  • 14
  • 88
  • 130
4
votes
1 answer

Change CPU count for RUNNING Slurm Jobs

I have a SLURM cluster and a RUNNING job where I have requested 60 threads by #SBATCH --cpus-per-task=60 (I am sharing threads on a node using cgroups) I now want to reduce the amount of threads to 30. $ scontrol update jobid=274332 NumCPUs=30 Job…
Mike Nathas
  • 1,247
  • 2
  • 11
  • 29
4
votes
1 answer

Make use of all CPUs on SLURM

I would like to run a job on the cluster. There are a different number of CPUs on different nodes and I have no idea which nodes will be assigned to me. What are the proper options so that the job can create as many tasks as CPUs on all…
4
votes
6 answers

is MPI widely used today in HPC?

is MPI widely used today in HPC?
Alka
  • 49
  • 2
4
votes
1 answer

Set environment variables all over processes in Julia

I'm currently working with Julia (1.0) to run some parallel code on clusters of an HPC. The HPC is managed with PBS. I'm trying to find a way for broadcasting environment variables over all processes, i.e. a way to broadcast a specific list of…
moudbis
  • 43
  • 3
4
votes
1 answer

Adding value to generic collection in class not allowed because of scope

I'm having trouble adding elements to an object that keeps a collection of generic-typed values. I tried a Minimal Working Example that causes the error: class OneElementQueue { type eltType; var elements : [0..0] eltType; …
Kyle
  • 554
  • 3
  • 10
4
votes
1 answer

Loop optimisation

I am trying to understand what cache or other optimizations could be done in the source code to get this loop faster. I think it is quite cache friendly but, are there any experts out there that could squeeze a bit more performance tuning this code?…
Manolete
  • 3,431
  • 7
  • 54
  • 92
4
votes
2 answers

Can I emulate MS Compute Cluster Server on my dev machine?

I have a project for a client that will consist of managing jobs on a MS Compute Cluster. I will be developing the application outside of their network, and would like a way to develop/debug my app without the need to be on their network. I am…
Todd
  • 620
  • 4
  • 13
4
votes
1 answer

Sparse Slicing of Sparse Arrays in Chapel

Given some A: [sps] over a sparse subdomain of a dom: domain(2), a slice A[A.domain.dim(1), k] yields the k​th​​ column as a dense 1D-array. How do I retrieve the k​th​​ n−1 dimensional slice of a sparse nD-array as a sparse (n-1)D-array? var nv:…
Tshimanga
  • 845
  • 6
  • 16
4
votes
2 answers

Figuring out an (Azure) OAuth2 authorization flow for HPC command line utilities

I'm pretty new to Azure/OAuth2 so apologies if this is a simple problem. My head's spinning though and I'd appreciate some pointers. I'm developing a command line utility for use in a high performance compute cluster. This utility needs to access a…
4
votes
1 answer

Submit job with python code (mpi4py) on HPC cluster

I am working a python code with MPI (mpi4py) and I want to implement my code across many nodes (each node has 16 processors) in a queue in a HPC cluster. My code is structured as below: from mpi4py import MPI comm = MPI.COMM_WORLD size =…
Commoner
  • 1,678
  • 3
  • 19
  • 34
4
votes
0 answers

C/C++ Guaranteeing Two Memory Allocations Arrive on Two Different Sticks of RAM

I'm finding myself bandwidth-constrained in a parallel computing application, and I've profiled the program during execution. The critical data is expectedly in a contiguous line, but the memory dumps show it is always ending up entirely or mostly…
patrickjp93
  • 399
  • 4
  • 20
4
votes
5 answers

C++ std::vector for HPC?

I am translating a program that perform numeric simulations from FORTRAN to C++. I have to deal with big matrices of double of the size of 800MB. This double M[100][100][100][100]; gives a segmentation error because the stack is not so big. Using…
Nisba
  • 3,210
  • 2
  • 27
  • 46
4
votes
1 answer

File Not found exception in the master of a COMPSs application

I am running an application implemented with COMPSs and I am getting the following error in the application standard output. ... [(2016-07-27 11:47:34,255) API] - No more tasks for app 1 [ERRMGR] - WARNING: Error master local copying…
Jorge Ejarque
  • 269
  • 1
  • 8
4
votes
1 answer

Strange error in a PyCOMPSs application: Script without last "y" not found

I am trying to run one of the example pyCOMPSs application with version 1.4 and I am getting the following error, which says that the python script without the final "y" can not be found. Do you have any idea what could be the…
user6634308
  • 129
  • 4