Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
4
votes
1 answer

Using Python multiprocessing on an HPC cluster

I am running a Python script on a Windows HPC cluster. A function in the script uses starmap from the multiprocessing package to parallelize a certain computationally intensive process. When I run the script on a single non-cluster machine, I obtain…
Trekkie
  • 964
  • 1
  • 9
  • 32
4
votes
1 answer

SCP failure running pyCOMPSs application

I have an error running an application implemented with pyCOMPSs. The application was working well but I have done some changes in the application and it has stopped to work. This is the stack I have got from the application: --- START OF NESTED…
Jorge Ejarque
  • 269
  • 1
  • 8
4
votes
2 answers

COMPSs runcompss error. Can't run the matmul example app locally

I have recently started using COMPSs. I am following one of the documentation examples, but it doesn't seem to be working. I am trying to run the provided matmul example app and I am using this command: runcompss --classpath=./matmul.jar…
Victor Anton
  • 125
  • 4
4
votes
2 answers

MPI_Send proper way to send a matrix

I have to use MPI API for send/receive matrices in my programs. To send a matrix I used the below syntax: MPI_Send(matrix, ...) <- USE THIS MPI_Send(&matrix, ...) MPI_Send(&matrix[0][0], ...) Similar to the last one, but untested:…
bogdan.rusu
  • 901
  • 4
  • 21
  • 41
4
votes
1 answer

Barrier after MPI non-blocking call, without bookkeeping?

I'm doing a bunch of MPI_Iallreduce non-blocking communications. I've added these Iallreduce calls to several different places in my code. Every so often, I want to pause and wait for all the Iallreduce calls to finish. Version 1 with MPI_Request…
solvingPuzzles
  • 8,541
  • 16
  • 69
  • 112
4
votes
1 answer

Operating in parallel on a large constant datastructure in Julia

I have a large vector of vectors of strings: There are around 50,000 vectors of strings, each of which contains 2-15 strings of length 1-20 characters. MyScoringOperation is a function which operates on a vector of strings (the datum) and returns…
Frames Catherine White
  • 27,368
  • 21
  • 87
  • 137
4
votes
2 answers

What is scratch space /filesystem in HPC

I am studying about HPC applications and Parallel Filesystems. I came across the term scratch space AND scratch filesystem. I cannot visualize where this scratch space exists. Is it on the compute node as a mounted filesystem /scratch or on the…
RootPhoenix
  • 1,626
  • 1
  • 22
  • 40
4
votes
1 answer

How to replace an existing file in MPI with MPI_File_open

I am reading "Using MPI-2" and try to execute the code myself. I specified MPI_MODE_CREATE for MPI_File_open, but it actually does not create a new file, instead, it overwrites the previous file with the same name. I happen to find this out when…
Sean
  • 2,649
  • 3
  • 21
  • 27
4
votes
4 answers

Infiniband in Java

As you all know, OFED's Socket Direct protocol is deprecated and OFED's 3.x releases do not come with SDP at all. Hence, Java's SDP also fails to work. I was wondering what is the proper method to program infiniband in Java? Is there any portable…
RoboAlex
  • 4,895
  • 6
  • 31
  • 37
4
votes
2 answers

How does a machine with higher CPU performance (according to gprof) have worse real time performacne?

Background I have a computationally intensive program that I am trying to run on a single supercomputer node. Here are the specs of one of the nodes on the supercomputer: OS: Redhat 6 Enterprise 64-bit CPU: Intel 2x 6-core 2.8GHz (12 cores) --…
Neal Kruis
  • 2,055
  • 3
  • 26
  • 49
4
votes
2 answers

MPI reuse MPI_Request

Is it safe to re-use a finished MPI_Request for another request ? I have been using a pool of MPI_Request to improve performance and there is no error. But it would be good to know for sure.
w00d
  • 5,416
  • 12
  • 53
  • 85
4
votes
1 answer

Intel C++ and Microsoft Compiler

I am working on a high performance scientific application and found that pushing the computations into Intel compiler gives a lot of speedups by generating fast code, vectorization and better auto parallelization. But my main application is till in…
Sai Venkat
  • 1,208
  • 9
  • 16
4
votes
4 answers

Test MPI on a cluster

I am learning OpenMPI on a cluster. Here is my first example. I expect the output would show response from different nodes, but they all respond from the same node node062. I just wonder why and how I can actually get report from different nodes to…
Tim
  • 1
  • 141
  • 372
  • 590
4
votes
1 answer

Microsoft HPC: check if user is logged

When submitting a job to a Microsoft HPC Server by using the HPC's API, a job is submitted by calling the SubmitJob function: void SubmitJob (ISchedulerJob job, string username, string password); If the username is null, the system uses the…
Ron Teller
  • 1,880
  • 1
  • 12
  • 23
4
votes
1 answer

Is OpenMP and MPI hybrid program faster than pure MPI?

I am developing some program than runs on 4 node cluster with 4 cores on each node. I have a quite fast version of OpenMP version of the program that only runs on one cluster and I am trying to scale it using MPI. Due to my limited experience I am…
Bob Fang
  • 6,963
  • 10
  • 39
  • 72