Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:

Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.

Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).

Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.

Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.

Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.

Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the hpc arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions

votes

2 answers

UPC in HPC - experience and suggestions

I am currently exploring some aspects of unified parallel C as an alternative to standard parallelization approaches in HPC (like MPI, OpenMP, or hydrid approaches). My question is: Does anyone have experience in UPC performance on large scale…

c parallel-processing hpc upc

asked Jun 29 '11 at 02:32

Mark

1,333
1
14
21

votes

14 answers

How to manipulate huge amounts of data

I'm having the following problem. I need to store huge amounts of information (~32 GB) and be able to manipulate it as fast as possible. I'm wondering what's the best way to do it (combinations of programming language + OS + whatever you think its…

arrays memory memory-management hpc

asked Apr 13 '10 at 13:44

Alejandro Cámara

votes

1 answer

Is "cudaMallocManaged" slower than "cudaMalloc"?

I downloaded CUDA 6.0 RC and tested the new unified memory by using "cudaMallocManaged" in my application.However, I found this kernel is slowed down. Using cudaMalloc followed by cudaMemcpy is faster (~0.56), compared to cudaMallocManaged…

cuda gpgpu nvidia hpc

asked Feb 24 '14 at 11:33

Genutek

votes

7 answers

How to be able to "move" all necessary libraries that a script requires when moving to a new machine

We work on scientific computing and regularly submit calculations to different computing clusters. For that we connect using linux shell and submitting jobs through SGE, Slurm, etc (it depends on the cluster). Our codes are composed of python and…

linux shared-libraries cluster-computing static-libraries hpc

asked Sep 15 '16 at 13:39

Open the way

26,225
51
142
196

votes

2 answers

Using many mutex locks

I have a large tree structure on which several threads are working at the same time. Ideally, I would like to have an individual mutex lock for each cell. I looked at the definition of pthread_mutex_t in bits/pthreadtypes.h and it is fairly short,…

c linux pthreads hpc

asked May 05 '10 at 12:52

hanno

6,401
8
48
80

votes

2 answers

"WindowsError: [Error 206] The filename or extension is too long" after running a program very many times with subprocess

My python program prepares inputs, runs an external FORTRAN code, and processes the outputs in a Windows HPC 2008 environment. It works great, unless the code executes the external program between 1042-1045 times (Usually the problem converges…

python windows subprocess hpc

asked May 23 '12 at 17:03

partofthething

1,071
1
14
19

votes

2 answers

MPI + GPU : how to mix the two techniques

My program is well-suited for MPI. Each CPU does its own, specific (sophisticated) job, produces a single double, and then I use an MPI_Reduce to multiply the result from every CPU. But I repeat this many, many times (> 100,000). Thus, it occurred…

mpi gpu hpc

asked Apr 09 '12 at 13:37

cmo

3,762
4
36
64

votes

1 answer

kubernetes with slurm, is this correct setup?

i saw that some people use Kubernetes co-exist with slurm, I was just curious as to why you need kubernetes with slurm? what is the main difference between kubernetes and slurm?

kubernetes containers hpc slurm

asked Jul 31 '19 at 02:48

zidni

votes

2 answers

How to ask GCC to completely unroll this loop (i.e., peel this loop)?

Is there a way to instruct GCC (I'm using 4.8.4) to unroll the while loop in the bottom function completely, i.e., peel this loop? The number of iterations of the loop is known at compilation time: 58. Let me first explain what I have tried. By…

c gcc x86 hpc loop-unrolling

asked Mar 20 '16 at 05:36

Zheyuan Li

71,365
17
180
248

votes

2 answers

cannot send std::vector using MPI_Send and MPI_Recv

I am trying to send std:vector using MPI send and recv functions but I have reached no where. I get errors like Fatal error in MPI_Recv: Invalid buffer pointer, error stack: MPI_Recv(186): MPI_Recv(buf=(nil), count=2, MPI_INT, src=0, tag=0,…

c++ vector parallel-processing mpi hpc

asked Mar 16 '15 at 02:27

Abdullah

votes

4 answers

C++ programming for clusters and HPC

I need to write a scientific application in C++ doing a lot of computations and using a lot of memory. I have part of the job but due to high requirements in terms of resources I was thinking to start moving to OpenMPI. Before doing that I have a…

c++ hpc cluster-computing

asked Mar 30 '10 at 21:00

Abruzzo Forte e Gentile

14,423
28
99
173

votes

2 answers

GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node

I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this…

hpc gnu-parallel

asked Mar 06 '14 at 21:25

Steve Koch

votes

10 answers

MPI or Sockets?

I'm working on a loosely coupled cluster for some data processing. The network code and processing code is in place, but we are evaluating different methodologies in our approach. Right now, as we should be, we are I/O bound on performance issues,…

c++ c cluster-computing hpc

asked Sep 30 '08 at 15:40

Nicholas Mancuso

11,599
6
45
47

votes

2 answers

Why would my parallel code be slower than my serial code?

In general, is it possible for a parallel code to be slower than the serial code? mine is and I am really frustrated at it! what can I do?

performance parallel-processing mpi hpc processing-efficiency

asked Dec 08 '12 at 20:50

Amani Lama

votes

10 answers

Are there clusters available to rent?

I am wondering if there are clusters available to rent. Scenario: We have a program that will take what we estimate a week to run(after optimization) on a given file. Quite possibly, longer. Unfortunately, we also need to do approximately 300+…

hpc

asked Apr 21 '09 at 02:26

Paul Nathan

39,638
28
112
212

Prev 1 2

…

99 100 Next