Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
4
votes
3 answers

Counting registers/thread in Cuda kernel

The nSight profiler tells me that the following kernel uses 52 registers per thread: //Just the first lines of the kernel. __global__ void voles_kernel(float *params, int *ctrl_params, float dt, float currTime, …
Matteo Fasiolo
  • 541
  • 6
  • 17
4
votes
2 answers

Is the PVM (parallel virtual machine) library widely used in HPC?

Has everyone migrated to MPI (message passing interface) or is PVM still widely used in supercomputers and HPC?
joemoe
  • 5,734
  • 10
  • 43
  • 60
4
votes
2 answers

Maximum number of resident threads per multiprocessor VS. Maximum number of resident blocks per multiprocessor

I'm running an issue on my K20 about resources with concurrent kernel execution. My streams only got a little overlap and then I thought this might because of a resources limitation. So I referred to the manual, and I found this: The maximum number…
Archeosudoerus
  • 1,101
  • 9
  • 24
4
votes
2 answers

Red-Black Gauss Seidel and OpenMP

I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP. The Gauss-Seidel iteration is split into two separate runs, such that in each…
Smidstrup
  • 81
  • 2
  • 10
4
votes
1 answer

Slurm: get limits for account

Our cluster is using SLURM to manage our job queue. Slurm is monitoring how many core hours each account has used, and will down-prioritize jobs submitted from an account that has used more than the allotted core hours. Is there a command in slurm…
dahlo
  • 35
  • 3
4
votes
1 answer

Alternative to macros for parallel iteration?

This is going to be a long story, but maybe some of you would like to study this case. I am working on parallel graph algorithm development. I've chosen a cutting-edge, HPC parallel graph data structure named STINGER. STINGER's mission statement…
clstaudt
  • 21,436
  • 45
  • 156
  • 239
4
votes
1 answer

Python 64-bit unable to start correctly (0xc00000cc) on Windows HPC

I'm trying to get my application ported over to 64-bit Python. Everything works fine on my 64-bit Windows 7 workstation (with a E8600 Core 2 Duo), but when I try to execute the same Python 2.7.2 64-bit program (which is stored in a network location)…
partofthething
  • 1,071
  • 1
  • 14
  • 19
4
votes
1 answer

reading file with UPC

I'm starting to learn UPC, and I have the following piece of code to read a file: upc_file_t *fileIn; int n; fileIn = upc_all_fopen("input_small", UPC_RDONLY | UPC_INDIVIDUAL_FP , 0, NULL); upc_all_fread_local(fileIn, &n, sizeof(int), 1,…
dx_mrt
  • 707
  • 7
  • 13
3
votes
1 answer

Configure OpenMPI to run on a single machine (Debian/Linux)

I've installed OpenMPI on my Ubuntu 11.04 machine. My understanding is that I type mpirun and magic happens. What I don't understand is how to configure mpirun to make this magic happen only on my machine's two cores. How do you configure OpenMPI to…
Richard
  • 56,349
  • 34
  • 180
  • 251
3
votes
3 answers

Condor central manager could not see the other computing nodes

I connect three servers to form an HPC cluster using condor as a middleware when I run the command condor_status from the central manager it does not shows the other nodes I can run jobs in the central manager and connect to the other nodes via SSH…
user1011891
  • 107
  • 1
  • 6
3
votes
3 answers

Usecases/experience of JavaScript for HPC (High Performance Computing)

Microsoft has announced preview version of Hadoop on Azure. JavaScript can also be used to write MapReduce jobs on Hadoop. I know that there had been a lot of work on JavaScript in the browsers for the last few years to improve the performance…
Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
3
votes
1 answer

Does CentOS support Condor?

I plan to make HPC cluster using Condor as middle-ware. Is CentOS a good choice to be the OS I mean does it support condor and is there any tutorial which could be helpful in installation process? Regards,
user1011891
  • 107
  • 1
  • 6
3
votes
3 answers

High Performance Computing Terminology: What's a GF/s?

I'm reading this Dr Dobb's Article on CUDA In my system, the global memory bandwidth is slightly over 60 GB/s. This is excellent until you consider that this bandwidth must service 128 hardware threads -- each of which can deliver a large…
andandandand
  • 21,946
  • 60
  • 170
  • 271
3
votes
0 answers

How to call system() within %dopar% iterations in R

How should I call external programs from sub-instances of parallelized R? The problem could occur also on other contexts, but I am using library(foreach) and library(doFuture) on slurm-based HPC. As an example, I have created a hello.txt that…
Imsa
  • 69
  • 4
3
votes
0 answers

NetLogo on HPC (Slurm) without BehaviorSpace

I want to change a little bit my workflow for running NetLogo on a HPC using Slurm. For context, I run around 360 simulations in parallel, each one can take from 5 to 7 days (I know, not efficient) and they write some outputs at the end of X ticks.…