Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
11
votes
2 answers

UPC in HPC - experience and suggestions

I am currently exploring some aspects of unified parallel C as an alternative to standard parallelization approaches in HPC (like MPI, OpenMP, or hydrid approaches). My question is: Does anyone have experience in UPC performance on large scale…
Mark
  • 1,333
  • 1
  • 14
  • 21
11
votes
14 answers

How to manipulate *huge* amounts of data

I'm having the following problem. I need to store huge amounts of information (~32 GB) and be able to manipulate it as fast as possible. I'm wondering what's the best way to do it (combinations of programming language + OS + whatever you think its…
11
votes
1 answer

Is "cudaMallocManaged" slower than "cudaMalloc"?

I downloaded CUDA 6.0 RC and tested the new unified memory by using "cudaMallocManaged" in my application.However, I found this kernel is slowed down. Using cudaMalloc followed by cudaMemcpy is faster (~0.56), compared to cudaMallocManaged…
Genutek
  • 387
  • 1
  • 5
  • 11
10
votes
7 answers

How to be able to "move" all necessary libraries that a script requires when moving to a new machine

We work on scientific computing and regularly submit calculations to different computing clusters. For that we connect using linux shell and submitting jobs through SGE, Slurm, etc (it depends on the cluster). Our codes are composed of python and…
Open the way
  • 26,225
  • 51
  • 142
  • 196
10
votes
2 answers

Using many mutex locks

I have a large tree structure on which several threads are working at the same time. Ideally, I would like to have an individual mutex lock for each cell. I looked at the definition of pthread_mutex_t in bits/pthreadtypes.h and it is fairly short,…
hanno
  • 6,401
  • 8
  • 48
  • 80
10
votes
2 answers

"WindowsError: [Error 206] The filename or extension is too long" after running a program very many times with subprocess

My python program prepares inputs, runs an external FORTRAN code, and processes the outputs in a Windows HPC 2008 environment. It works great, unless the code executes the external program between 1042-1045 times (Usually the problem converges…
partofthething
  • 1,071
  • 1
  • 14
  • 19
10
votes
2 answers

MPI + GPU : how to mix the two techniques

My program is well-suited for MPI. Each CPU does its own, specific (sophisticated) job, produces a single double, and then I use an MPI_Reduce to multiply the result from every CPU. But I repeat this many, many times (> 100,000). Thus, it occurred…
cmo
  • 3,762
  • 4
  • 36
  • 64
9
votes
1 answer

kubernetes with slurm, is this correct setup?

i saw that some people use Kubernetes co-exist with slurm, I was just curious as to why you need kubernetes with slurm? what is the main difference between kubernetes and slurm?
zidni
  • 91
  • 1
  • 1
  • 2
9
votes
2 answers

How to ask GCC to completely unroll this loop (i.e., peel this loop)?

Is there a way to instruct GCC (I'm using 4.8.4) to unroll the while loop in the bottom function completely, i.e., peel this loop? The number of iterations of the loop is known at compilation time: 58. Let me first explain what I have tried. By…
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
9
votes
2 answers

cannot send std::vector using MPI_Send and MPI_Recv

I am trying to send std:vector using MPI send and recv functions but I have reached no where. I get errors like Fatal error in MPI_Recv: Invalid buffer pointer, error stack: MPI_Recv(186): MPI_Recv(buf=(nil), count=2, MPI_INT, src=0, tag=0,…
Abdullah
  • 314
  • 1
  • 3
  • 10
9
votes
4 answers

C++ programming for clusters and HPC

I need to write a scientific application in C++ doing a lot of computations and using a lot of memory. I have part of the job but due to high requirements in terms of resources I was thinking to start moving to OpenMPI. Before doing that I have a…
Abruzzo Forte e Gentile
  • 14,423
  • 28
  • 99
  • 173
9
votes
2 answers

GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node

I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this…
Steve Koch
  • 912
  • 8
  • 21
9
votes
10 answers

MPI or Sockets?

I'm working on a loosely coupled cluster for some data processing. The network code and processing code is in place, but we are evaluating different methodologies in our approach. Right now, as we should be, we are I/O bound on performance issues,…
Nicholas Mancuso
  • 11,599
  • 6
  • 45
  • 47
9
votes
2 answers

Why would my parallel code be slower than my serial code?

In general, is it possible for a parallel code to be slower than the serial code? mine is and I am really frustrated at it! what can I do?
8
votes
10 answers

Are there clusters available to rent?

I am wondering if there are clusters available to rent. Scenario: We have a program that will take what we estimate a week to run(after optimization) on a given file. Quite possibly, longer. Unfortunately, we also need to do approximately 300+…
Paul Nathan
  • 39,638
  • 28
  • 112
  • 212