Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
3
votes
0 answers

Calling MPI subprocess within python script run from SLURM job

I am having trouble launching a SLURM job calling a mpirun subprocess from a python script. Inside the python script (let's call it script.py) I have this subprocess.run: import subprocess def run_mpi(config_name, np, working_dir): data_path…
Betelgeuse
  • 682
  • 2
  • 8
  • 27
3
votes
1 answer

Analysing performance of transpose function

I've written a naive and an "optimized" transpose functions for order-3 tensors containing double-precision complex numbers and I would like to analyze their performance. Approximate code for naive transpose function: #pragma omp for…
3
votes
2 answers

Slurm Job is Running out of Memory [RAM?] but memory limit not reached

I run simulations on a hpc-cluster which are quite memory demanding. I'm fitting cmdstan models with 3000 iterations for different conditions(200 unique combinations). To do this, I'm using the simDesign package in R. The simulations run perfectly…
3
votes
0 answers

`doParallel` vs `future` while using `Seurat` package

Here is the story. From Seurat vignette, FindMarkers() can be accelerated by utilizing future package, future::plan("multiprocess", workers = 4) However, I am running a simulation that I need to use FindAllMarkers() inside a doParallel::foreach()…
yuw444
  • 380
  • 2
  • 10
3
votes
1 answer

Use singularity container as python interpreter in Visual Studio Code

I am connecting to an HPC environment through VScode remote ssh and would like to run python code directly in VScode for testing purposes. I would like to set the python interpreter to a singularity container which runs python upon execution. This…
3
votes
1 answer

Is possible to use tensor cores and cuda cores in a mixed way?

I have RTX2060 Nvidia graphic card which has tensor cores on it. I want to run my codel utilizing tensor cores and cuda cores in a mixed way.The idea is to have a part of the code executed by tensor cores and another part by the cuda cores, in order…
user17271389
3
votes
2 answers

How can I load and merge several .txt files in a memory efficient way in python?

I am trying to read several (>1000) .txt files (on average approx. 700 MB, delimited, header-less CSV, without commas or other separator) and merge them into one pandas dataframe (to next run an analysis on the entire dataset). I am running this…
3
votes
0 answers

How to run initialization commands after SSH in VS Code Remote?

Problem I am trying to connect to my school's computing cluster (aka a linux server with "login node" and "computing node") using VS Code's Remote SSH, but I cannot figure out how to run a command after SSH-ing. Goal I simply want to view Python…
3
votes
1 answer

How to run singularity container on HPC cluster? - ERROR : Failed to create user namespace: user namespace disabled

I'm trying to launch a singularity container on a hpc cluster. I have been running the projectNetv2.sif and sandbox on my local with no issue. After exporting them to a hpc I get the following error. (singularity) [me@hpc Project]$ ls examples …
Zizi96
  • 459
  • 1
  • 6
  • 23
3
votes
1 answer

The openmp matrix multiplication

I try to write a Openmp based matrix multiplication code. The multiplication of matrix mm and matrix mmt is diagonal matrix and equal to one. I try normal calculation and Openmp. The normal result is correct, however the Openmp result is wrong. I…
3
votes
0 answers

Loss of precision while using OpenMP reduction

I was writing a simple subroutine in Fortran for dot product of vectors using OpenMP reduction parallelization. However its results are significantly different then dot_product and non-openMP do loop summation. Please see the code below: program…
ipcamit
  • 330
  • 3
  • 16
3
votes
1 answer

How to retrieve the content of slurm script?

I submitted a job several days ago, and it is still running now. But I forget the content of that script.sh that day. And the script.sh has been deleted. Do you know how to recover the content of that script?
Jingnan Jia
  • 1,108
  • 2
  • 12
  • 28
3
votes
1 answer

Hostfile with Mpirun on multinode with slurm

I have two executables I would like to run in the following way: For each node I want to launch N-1 processes to exe1 and 1 exe2 On previous slurm system that worked by doing such: #!/bin/bash -l #SBATCH --job-name=XXX #SBATCH --nodes=2 #SBATCH…
ATK
  • 1,296
  • 10
  • 26
3
votes
1 answer

Is there a way to enable avx2 intruction set without auto-vectorization by LLVM

Recently I met a problem that my avx2 optimized program may crash on old machines like 2010 mac, which does not support avx2 intruction set. At the same time, I can ensure that all my avx2 code is surrounded by dynamically detection of instruction,…
3
votes
1 answer

slurm does not provide valid information via email?

I use #SBATCH --mail-type=end #SBATCH --mail-user=myemail@gmail.com in my script.sh to send me valid information about the job. But what I received is empty without any valid information: JOB NAME: EXIT STATUS: COMPLETED SUMBITTED ON: STARTED…
Jingnan Jia
  • 1,108
  • 2
  • 12
  • 28