Questions tagged [hpc]

High Performance Computing encompasses using "supercomputers" with high numbers of CPUs, large parallel storage systems and advanced networks to perform time-consuming calculations. Parallel algorithms and parallelization of storage are essential to this field, as well as issues with complex, fast networking fabrics such as Infiniband.

High Performance Computing(HPC) encompasses many aspects of traditional computing and is utilized by a variety of fields including but not limited to particle physics, computer animation/CGI for major films, cancer/genomics research and modeling the climate. HPC systems, sometimes called 'supercomputers' are typically large numbers of high-performance servers with large numbers of CPUs and cores, interconnected by a high speed fabric or network.

A list of the top500 fastest computers on the planet is maintained as well as a list of the 500 most energy efficient computers. The performance of these systems is measured using the LINPACK benchmark, though a new benchmark using a conjugate gradient method, which is more representative of modern HPC workloads. IBM, Cray and SGI are major manufacturers of HPC systems and software though it should be noted that over 70% of the systems on the top500 list are based on Intel platforms.

Interconnect fabric technology is also crucial to HPC systems, many of which rely on internal high-speed networks made up of Infiniband or similar low-latency high-bandwidth networks. In addition to interconnect technology, GPU and coprocessors have recently been gaining in popularity for their ability to accelerate certain types of workloads.

Software is an additional concern for HPC systems as typical programs are not equipped to run on such a large scale. Many hardware manufacturers also produce their own software stacks for HPC systems which include compilers, drivers, parallelization and math libraries, system management interfaces and profiling tools specifically designed to work with the hardware they produce.

Most HPC systems use a highly modified linux kernel that is stripped down to only the essential components required to run the software on supplied hardware. Many modern HPC systems are setup in a 'stateless' manner, which means that no OS data is stored locally on compute nodes and an OS image is loaded into RAM typically over the network using PXE boot. This functionally allows the nodes to be rebooted into a clean, known-good working state. This is desirable in HPC systems as it is sometimes difficult to effectively cleanup processes that were running across several nodes in parallel cleanly.

116 questions
2
votes
1 answer

Is Ondemand Governor enabled in current HPC clusters?

Will enabling Ondemand GOvernor on HPC cluster help save power ? Are sleep states (C-states) enabled in HPC platforms ? If not, what is the reason behind this ?
kashyapa
  • 337
  • 4
  • 17
2
votes
2 answers

WEB based HPC cluster node management

i am working on my school diploma thesis. The main goal is to create web based application where logged users could see free and busy nodes, turn them on and off, see what process they are running etc. Figured out that i could do something like this…
Skuja
  • 25
  • 2
2
votes
2 answers

Windows HPC Server 2008: private network across VMs?

Windows HPC Server 2008 provides the option to automatically deploy OS images to new cluster node, using Windows Deployment Services. However, this requires the HPC cluster to be set up with a "private network" network topology. From HPC Cluster…
Max
  • 365
  • 2
  • 5
  • 17
2
votes
5 answers

setting up a cluster

I have 5 PC(s) connected over a LAN through a switch. I want to connect them to form a HPC cluster. The OS may be any Linux version (currently I have installed Ubuntu 8.10, 9.10 and Fedora 10) Purpose of the Cluster 1. To execute my C code developed…
Vaibhav
2
votes
0 answers

Infiniband fabric with 3 nodes - newbie

I am trying to connect 3 HP z840 workstations using: Mellanox ConnectX-3 VPI 40 / 56GbE Dual-Port QSFP Adapter MCX354A-FCBT Mellanox SX6005 12-port Non-blocking Unmanaged 56Gb/s Description of machines to be connected: oak-rd0-linux (main node…
theenemy
  • 121
  • 2
2
votes
1 answer

How can I set up interactive-job-only or batch-job-only partition on a SLURM cluster?

I'm managing a PBS/torque HPC cluster, and now I'm setting up another cluster with SLURM. On the PBS cluster, I can set a queue to accept only interactive jobs by qmgr -c "set queue interactive_q disallowed_types = batch" and to accept only batch…
wdg
  • 153
  • 1
  • 5
2
votes
0 answers

Lustre glitch: latency of minutes

Using a HPC lustre filesystem, we occasionally experience glitchiness where even simply opening a terminal and typing "ls" can take minutes to return. That is, any process that involves the filesystem has random massive latency (but generally…
benjimin
  • 121
  • 3
2
votes
0 answers

Current single system image solutions

I'm designing a cluster for a small research institute. Since our computations require a large amount of memory, I'm looking for a solution that will allow our applications access to the whole memory distributed across different nodes. The access…
Piotr M
  • 33
  • 3
2
votes
1 answer

Considerations using consumer class (high-end) GPU in server?

Motivation: First of all, even if I have some knowledge of computer science, software development and server Linux administration, I never looked into a server hardware and I am a total "newbie" to it. Sorry if this question is trivial to most of…
Adrian Maire
  • 145
  • 1
  • 10
2
votes
2 answers

Infiniband drivers : OFED or distro included?

I'm setting up a Linux cluster with infiniband network, and I'm quite a newby in infiniband wolrd, any advice is more than welcome ! We are currently using Mellanox OFED drivers, but our infiniband cards are old and not recognized by the latest…
nirnaeth
  • 33
  • 6
2
votes
1 answer

SLURM with "partial" head node

I am trying to install SLURM with NFS on a small ubuntu 18.04 HPC cluster, in a typical fashion, e.g. configure controller (slurmctld) and clients (slurmd) and shared directory, etc. What I am curious about is, is there a way to set it up such that…
rage_man
  • 123
  • 3
2
votes
1 answer

HTCondor high availability

I am currently trying to make the job queue and submission mechanism of a local, isolated HTCondor cluster highly available. The cluster consists of 2 master servers (previously 1) and several compute nodes and a central storage system. DNS, LDAP…
1
vote
1 answer

ifconfig apparently showing wrong RX/TX values for InfiniBand HCA

Recently, I executed a watch -n 1 ipconfig on one of our Linux cluster computing nodes while it was running a 48-process MPI run, disributed over several nodes. Oddly, while Ethernet packets seem to be counted correctly (a few kb/s due to the SSH…
andreee
  • 133
  • 1
  • 6
1
vote
2 answers

Containers for HPC batch processing

We are facing the problem that a lot of people want to run different scientific software on our high performance computing cluster. Every user requires a different set of libraries and library versions and we do not want the administrator to deal…
J. Doe
  • 13
  • 3
1
vote
1 answer

Slurm: Have two separate queues for GPU and CPU-only jobs

At the moment, we have set up Slurm to manage a small cluster of six nodes with four GPUs each. That has been working great so far, but now we want to utilize the Intel Core i7-5820K CPUs for jobs which only require CPU processing power. Each CPU…
Micha
  • 121
  • 4