Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
5
votes
1 answer

Spark and InfiniBand

I am trying to use Spark in a HPC focused cluster that has infiniband interconnections. This cluster does not provide support for IPoIB. I saw the Spakr-RDMA project from ohio state university in here. I cannot find anyone else working on this, or…
M.Rez
  • 1,802
  • 2
  • 21
  • 30
5
votes
1 answer

Shuffle AVX 256 Vector elements by 1 position left/right - C intrinsics

I'm trying to find a more efficient way to "rotate" or shift the 32 bit floating point values within an avx _m256 vector to the right or left by one place. Such that: a7, a6, a5, a4, a3, a2, a1, a0 becomes 0, a7, a6, a5, a4, a3, a2, a1 (I dont mind…
MishMash95
  • 753
  • 7
  • 9
5
votes
1 answer

Desynchronized traces in COMPSs

I am generating traces of my executions using COMPSs 1.4. I have noticed that some tasks with data dependencies among them overlap in the tracefile. This shouldn't not be possible. I also checked the dependencies graph and they seem to be correct. I…
5
votes
0 answers

Hierarchical clustering parallel processing in R

Is there straightforward method to take advantage of parallel processing in R within a HPC cluster to make my computations faster for Hierarchical clustering algorithm? Because right now, the average utilization of processors is just 1 though i can…
Kraamed
  • 51
  • 2
5
votes
1 answer

How to check which MCA parameters are used in OpenMPI?

In the OpenMPI codebase, each module has multiple variants. When calling mpirun, you can select the modules from the Modular Component Architecture (MCA) that you would like to use. The options include... collective algorithms (coll): basic, tuned,…
solvingPuzzles
  • 8,541
  • 16
  • 69
  • 112
5
votes
1 answer

File not found in task defined in COMPSs

I have implemented an application with COMP Superscalar and I got task failed. Looking at the standard error file (job1_NEW.err) file I got a File Not Found exception but the file exists in my computer. Any idea what could be the error? EDIT: Added…
Jorge Ejarque
  • 269
  • 1
  • 8
5
votes
0 answers

OpenMP and C++11 multithreading

I am currently working on a project that mixes high-performance computing (HPC) and interactivity. As such, the HPC part relies on OpenMP (mainly for-loops with lots of identical computations) but it is included in a larger framework with a GUI and…
oLen
  • 5,177
  • 1
  • 32
  • 48
5
votes
2 answers

automatically retrieve results of bsub

I am looking for some general advice rather than a coding solution. Basically when submitting a job via bsub I can retrieve a log of the Stdin/Stdout by specifying any of the following: bsub -o log.txt % sends StdOut to log.txt bsub -u me@email…
brucezepplin
  • 9,202
  • 26
  • 76
  • 129
5
votes
1 answer

Mvapich2 buffer aliasing

I am launched an MPI program with MVAPICH2 and got this error: Fatal error in PMPI_Gather: Invalid buffer pointer, error stack: PMPI_Gather(923): MPI_Gather() failed PMPI_Gather(857): Buffers must not be aliased There are two ways I think I could…
xijhoy
  • 167
  • 2
  • 5
  • 10
5
votes
2 answers

Profiling distributed systems

I was wondering about possible ways to track down performance bottlenecks in distributed systems. I am aware of tools like X-Trace and its offspring (e.g. Dapper) but I am more curious about the methodology rather than specific tools. In other…
5
votes
4 answers

STL containers speed vs. arrays

I just started working on a scientific project where speed really matters (HPC). I'm currently designing the data structes. The core of the project is a 3D-Grid of double values, in order to solve a partial differenital equation. Since speed here…
Chris
  • 2,030
  • 1
  • 16
  • 22
5
votes
4 answers

What are the advantages and disadvantages of GPGPU (general-purpose GPU) development?

I am wondering what is the key thing that helps you in GPGPU development and of course what is the constraints that you find unacceptable. Comes to mind for me: Key advantage: the raw power of these things Key constraint: the memory model What's…
Fabien Hure
  • 644
  • 3
  • 7
  • 17
5
votes
2 answers

What is the most efficient (yet sufficiently flexible) way to store multi-dimensional variable-length data?

I would like to know what the best practice for efficiently storing (and subsequently accessing) sets of multi-dimensional data arrays with variable length. The focus is on performance, but I also need to be able to handle changing the length of an…
5
votes
2 answers

Advantages of Tesla over GeForce

I've read some information that I could find over the Internet about differences between those 2 series of cards, but I can't help the feeling that they are somehow advertisements. While most powerful GeForce costs roughly $700, starting prices for…
Raven
  • 4,783
  • 8
  • 44
  • 75
5
votes
5 answers

Dearth of CUDA 5 Dynamic Parallelism Examples

I've been googling around and have only been able to find a trivial example of the new dynamic parallelism in Compute Capability 3.0 in one of their Tech Briefs linked from here. I'm aware that the HPC-specific cards probably won't be available…
maxywb
  • 2,275
  • 1
  • 19
  • 25