Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
8
votes
1 answer

Can you transpose array when sending using MPI_Type_create_subarray?

I'm trying to transpose a matrix using MPI in C. Each process has a square submatrix, and I want to send that to the right process (the 'opposite' one on the grid), transposing it as part of the communication. I'm using MPI_Type_create_subarray…
robintw
  • 27,571
  • 51
  • 138
  • 205
8
votes
1 answer

Can't run COMPSs application. ClassNotFoundException

I am learning COMPSs. Until now, everything has been working really well, but I only executed the examples given in the manual. Now that I want to run my own test application, I can't get it to work. I must be missing something, but I can't see what…
Victor Anton
  • 125
  • 4
8
votes
1 answer

Large-scale pseudoinverse

I would like to compute the Moore–Penrose pseudoinverse of an enormous matrix. Ideally, I would like to do it on a matrix that has 23 million rows and 1000 columns, but if necessary I can reduce the number of rows to 4 million by only running on…
Vebjorn Ljosa
  • 17,438
  • 13
  • 70
  • 88
8
votes
2 answers

How can I load a server's specific R installation (environment module) when launching a local installation of emacs?

I am using a cluster with environment modules. This means that I must specifically load any R version other than the default (2.13) so to load R 3.0.1, I have to specify module load R/3.0.1 R I have added module load R/3.0.1 to .bashrc, so that if…
David LeBauer
  • 31,011
  • 31
  • 115
  • 189
8
votes
1 answer

C/C++ Framework for distributed computing in a dynamic cluster

I am looking for a framework to be used in a C++ distributed number crunching application. The setup looks as follows: There is a master node which divides the problem domain into small independent tasks. The tasks are distibuted to worker nodes…
Erik
  • 11,944
  • 18
  • 87
  • 126
7
votes
3 answers

CUDA, OpenCL, PGI, etc.... but what happened to GLSL and Cg?

CUDA, OpenCL, and the GPU options offered by the Portland Group are intriguing... Results are impresive (125-times speedup for some groups). It sounds like the next wave of GPGPU tools are poised to dominate the scientific computing world. …
Pete
  • 10,310
  • 7
  • 53
  • 59
7
votes
1 answer

"Cannot open the connection" - HPC in R with snow

I'm attempting to run a parallel job in R using snow. I've been able to run extremely similar jobs with no trouble on older versions of R and snow. R package dependencies prevent me from reverting. What happens: My jobs terminate at the parRapply…
Sarah
  • 1,614
  • 1
  • 23
  • 37
7
votes
3 answers

Strict load balancing of multiple .NET processes

I have a multi-process .NET (F#) scientific simulation running on Windows Server 2008 SE and 64 processors. Each time step of the simulation oscillates from 1.5 sec to 2 sec. As each process must wait for other processes, the overall speed is the…
Oldrich Svec
  • 4,191
  • 2
  • 28
  • 54
7
votes
3 answers

error: cgroup namespace 'freezer' not mounted. aborting

Trying to run slurmd: sudo systemctl start slurmd I display the status of the daemon and an error is displayed on the screen: >>sudo systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/lib/systemd/system/slurmd.service;…
user8788726
7
votes
2 answers

Ensure hybrid MPI / OpenMP runs each OpenMP thread on a different core

I am trying to get a hybrid OpenMP / MPI job to run so that OpenMP threads are separated by core (only one thread per core). I have seen other answers which use numa-ctl and bash scripts to set environment variables, and I don't want to do this. I…
v2v1
  • 640
  • 8
  • 18
7
votes
1 answer

Best block size value for block matrix matrix multiplication

I want to do block matrix-matrix multiplication with the following C code.In this approach, blocks of size BLOCK_SIZE is loaded into the fastest cache in order to reduce memory traffic during calculation. void bMMikj(double **A , double **B , double…
7
votes
3 answers

Passing arguments to a python script in a SLURM batch script

I've written a python script that requires two arguments and works just fine when I run it on the command line with: pythonscript.py arg1 arg2 I need to run this in a SLURM batch script, but whenever I do I get an "illegal instruction" error and…
Jiffy
  • 73
  • 1
  • 2
  • 4
7
votes
4 answers

HPC (mainly on Java)

I'm looking for some way of using the number-crunching ability of a GPU (with Java perhaps?) in addition to using the multiple cores that the target machine has. I will be working on implementing (at present) the A* Algorithm but in the future I…
Insectatorious
  • 1,305
  • 3
  • 14
  • 29
7
votes
1 answer

hadoop/yarn and task parallelization on non-hdfs filesystems

I've instantiated a Hadoop 2.4.1 cluster and I've found that running MapReduce applications will parallelize differently depending on what kind of filesystem the input data is on. Using HDFS, a MapReduce job will spawn enough containers to maximize…
calvin
  • 75
  • 5
7
votes
1 answer

Serverless concurrent write access in Python

Are there any packages in Python that support concurrent writes on NFS using a serverless architecture? I work in an environment where I have a supercomputer, and multiple jobs save their data in parallel. While I can save the result of these…
Josh
  • 11,979
  • 17
  • 60
  • 96