Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
15
votes
3 answers

Containerize a conda environment in a Singularity container

I've come across several instances where it would be really helpful to containerize a conda environment for long-term reproducibility. As I'm normally running in high-performance computing systems, they need to be Singularity containers for security…
LucasBoatwright
  • 1,456
  • 1
  • 16
  • 20
15
votes
3 answers

Unable to use all cores with mpirun

I'm testing a simple MPI program on my desktop (Ubuntu LTS 16.04/ Intel® Core™ i3-6100U CPU @ 2.30GHz × 4/ gcc 4.8.5 /OpenMPI 3.0.0) and mpirun won't let me use all of the cores on my machine (4). When I run: $ mpirun -n 4 ./test2 I get the…
James Smith
  • 327
  • 1
  • 3
  • 11
15
votes
2 answers

HPC programming language relying on implicit vectorization

Are there programming languages or language extensions that rely on implicit vectorization? I would need something that make aggressive assumptions to generate good DLP/vectorized code, for SSE4.1, AVX, AVX2 (with or without FMA3/4) in single/double…
diegor
  • 542
  • 2
  • 15
15
votes
4 answers

When is a program limited by the memory bandwidth?

I want to know if a program that I am using and which requires a lot of memory is limited by the memory bandwidth. When do you expect this to happen? Did it ever happen to you in a real-life scenario? I found several articles discussing this…
hanno
  • 6,401
  • 8
  • 48
  • 80
14
votes
1 answer

How Do I Attain Peak CPU Performance With Dot Product?

Problem I have been studying HPC, specifically using matrix multiplication as my project (see my other posts in profile). I achieve good performance in those, but not good enough. I am taking a step back to see how well I can do with a dot product…
matmul
  • 589
  • 5
  • 16
14
votes
6 answers

Using celery to process huge text files

Background I'm looking into using celery (3.1.8) to process huge text files (~30GB) each. These files are in fastq format and contain about 118M sequencing "reads", which are essentially each a combination of header, DNA sequence, and quality…
Chris F.
  • 773
  • 6
  • 15
14
votes
2 answers

How to pin threads to cores with predetermined memory pool objects? (80 core Nehalem architecture 2Tb RAM)

I've run into a minor HPC problem after running some tests on a 80core (160HT) nehalem architecture with 2Tb DRAM: A server with more than 2 sockets starts to stall a lot (delay) as each thread starts to request information about objects on the…
root-11
  • 1,727
  • 1
  • 19
  • 33
14
votes
4 answers

Submit jobs to a slave node from within an R script?

To get myscript.R to run on a cluster slave node using a job scheduler (specifically, PBS) Currently, I submit an R script to a slave node using the following command qsub -S /bin/bash -p -1 -cwd -pe mpich 1 -j y -o output.log ./myscript.R Are…
David LeBauer
  • 31,011
  • 31
  • 115
  • 189
14
votes
1 answer

Shared Library bottleneck on NUMA machine

I'm using a NUMA machine (an SGI UV 1000) to run a large number of numerical simulations at the same time, each of which is an OpenMP job using 4 cores. However, running more than around 100 of these jobs results in a significant performance hit.…
acroz
  • 165
  • 10
13
votes
1 answer

What is the recommended compression for HDF5 for fast read/write performance (in Python/pandas)?

I have read several times that turning on compression in HDF5 can lead to better read/write performance. I wonder what ideal settings can be to achieve good read/write performance at: data_df.to_hdf(..., format='fixed', complib=..., complevel=...,…
Mark Horvath
  • 1,136
  • 1
  • 9
  • 24
13
votes
4 answers

Do multiple CPUs compete for the same memory bandwidth?

In a multi-CPU machine, do the different CPUs compete for the same memory bandwidth, or do they access DRAM independently? In other words, if a program is memory bandwidth limited on, say, a 1-CPU 8-core system, would moving to a 4-CPU 4*8-core…
MWB
  • 11,740
  • 6
  • 46
  • 91
12
votes
3 answers

Tips and tricks on improving Fortran code performance

As part of my Ph.D. research, I am working on development of numerical models of atmosphere and ocean circulation. These involve numerically solving systems of PDE's on the order of ~10^6 grid points, over ~10^4 time steps. Thus, a typical model…
milancurcic
  • 6,202
  • 2
  • 34
  • 47
12
votes
1 answer

Python: How to profile code written with numba.njit() decorators

I have a fairly complex computational code that I'm trying to speed up and multi-thread. In order to optimize the code, I'm trying to work out which functions are taking the longest or being called the most. I haven't really profiled code before, so…
Yoshi
  • 671
  • 8
  • 20
12
votes
7 answers

F# as a HPC language

I develop a Lattice Boltzmann (Fluid dynamics) code using F#. I am now testing the code on a 24 cores, 128 GB memory server. The code basically consists of one main recursive function for time evolution and inside a…
Oldrich Svec
  • 4,191
  • 2
  • 28
  • 54
12
votes
2 answers

Log files in massively distributed systems

I do a lot of work in the grid and HPC space and one of the biggest challenges we have with a system distributed across hundreds (or in some case thousands) of servers is analysing the log files. Currently log files are written locally to the disk…
John Channing
  • 6,501
  • 7
  • 45
  • 56