Questions tagged [hpc]

High Performance Computing encompasses using "supercomputers" with high numbers of CPUs, large parallel storage systems and advanced networks to perform time-consuming calculations. Parallel algorithms and parallelization of storage are essential to this field, as well as issues with complex, fast networking fabrics such as Infiniband.

High Performance Computing(HPC) encompasses many aspects of traditional computing and is utilized by a variety of fields including but not limited to particle physics, computer animation/CGI for major films, cancer/genomics research and modeling the climate. HPC systems, sometimes called 'supercomputers' are typically large numbers of high-performance servers with large numbers of CPUs and cores, interconnected by a high speed fabric or network.

A list of the top500 fastest computers on the planet is maintained as well as a list of the 500 most energy efficient computers. The performance of these systems is measured using the LINPACK benchmark, though a new benchmark using a conjugate gradient method, which is more representative of modern HPC workloads. IBM, Cray and SGI are major manufacturers of HPC systems and software though it should be noted that over 70% of the systems on the top500 list are based on Intel platforms.

Interconnect fabric technology is also crucial to HPC systems, many of which rely on internal high-speed networks made up of Infiniband or similar low-latency high-bandwidth networks. In addition to interconnect technology, GPU and coprocessors have recently been gaining in popularity for their ability to accelerate certain types of workloads.

Software is an additional concern for HPC systems as typical programs are not equipped to run on such a large scale. Many hardware manufacturers also produce their own software stacks for HPC systems which include compilers, drivers, parallelization and math libraries, system management interfaces and profiling tools specifically designed to work with the hardware they produce.

Most HPC systems use a highly modified linux kernel that is stripped down to only the essential components required to run the software on supplied hardware. Many modern HPC systems are setup in a 'stateless' manner, which means that no OS data is stored locally on compute nodes and an OS image is loaded into RAM typically over the network using PXE boot. This functionally allows the nodes to be rebooted into a clean, known-good working state. This is desirable in HPC systems as it is sometimes difficult to effectively cleanup processes that were running across several nodes in parallel cleanly.

116 questions
1
vote
1 answer

ssh port forwarding (tunneling in HPC)

I have an application server that runs on a compute node. The server opens a port (9000) and I then run a command for tunneling between my local machine and the server: ssh -N -f -L 9000:compute-node:9000 user@myhpc Once this is done I can…
moth
  • 111
  • 4
1
vote
1 answer

HPC cluster master node as virtual machine

For a given small HPC cluster (~16 nodes) a master node is used as a front-end for users to login and interact with SLURM, and not as a computing node. The master node is currently a bare-metal server. Since the cluster is so small, the idea came up…
1
vote
1 answer

Wrong LDAP user ID is mapped into Slurm account management service

I configured a Slurm head node as follows: sssd to contact openLDAP slurmctld/slurmdbd/slurmd/munged to act as the Slurm controller and compute node ...where ray.williams is an LDAP user. Its UID can be mapped on the node. SSH login works…
Nicolas De Jay
  • 209
  • 2
  • 11
1
vote
1 answer

Single-node SLURM server: restrict interactive CPU usage

I have SLURM setup on a single node, which is also a 'login node'. I would like to restrict interactive CPU usage, e.g. outside the scheduling system. I found the following article which suggests to use cgroups for this:…
Compizfox
  • 384
  • 1
  • 6
  • 18
0
votes
1 answer

What does "CPU Minutes" mean exactly?

I'm actually trying to report cluster utilization in Slurm but i don't understand the metric CPU Minutes. [root@XXXX]# sreport cluster Utilization Start=2018-12-01…
m4hmud
  • 3
  • 3
0
votes
1 answer

Exascale Power Consumption

I have read a lot of articles about exascale and found out that it may consumes approximately 20MW power envelope. Is it a daily basis or a yearly basis or every second? Please enlighten me. Here are the papers I have…
alyssaeliyah
  • 81
  • 1
  • 8
0
votes
1 answer

Configure Singularity to do headless rendering / use OpenGL / glxgears / glxinfo

I want to do headless rendering on a server where I do not have root permissions. Therefore, I created a Singularity container like this: Bootstrap: docker From: nvidia/cuda:9.0-runtime-ubuntu16.04 %post apt-get update && apt-get -y install \ …
thigi
  • 101
  • 4
0
votes
0 answers

How to handle mpi head node failure?

There is app which starting with mpirun. If compute node fail then all processes crush, but if only head node fail(for example reboot) then processes will stuck on compute nodes. How to get rid of this zombie processes automatically?
Severgun
  • 163
  • 2
  • 8
0
votes
0 answers

Ideal configuration for a head node?

Which hardware should I concentrate on, when assembling a head node for an HPC cluster? The main task for the head node is to relay instructions to the compute nodes which will be running artificial intelligence algorithms. Ubuntu 14.04 LTS will be…
Rushat Rai
  • 111
  • 4
0
votes
1 answer

SSH vs qlogin to use all processors of a computing node

I have a SGE cluster consisting of four computing nodes, each with 20 processors. I do not mind to give one particular user the full capabilities of one specific node, i.e. I do not mind he/she uses all the 20 processors. My question then is, should…
Paco el Cuqui
  • 199
  • 1
  • 1
  • 8
0
votes
0 answers

Deployment of Base-Node via iSCSI in Server 2012R2 HPC cluster fails (can not join domain)

We are currently evaluating Server 2012R2 with HPC Pack for an upcoming project. Sadly we are stuck at deploying the base node. The node boots via PXE (iPXE) and connects to iSCSI, installs Windows but then seems unable to join the domain. Once the…
0
votes
2 answers

Numerous pbs_server errors in /var/log/messages

On supercomputer's management node we receive numerous errors such as: pbs_server: LOG_ERROR::is_request, bad attempt to connect from 10.10.0.254:1023 (address not trusted - check entry in server_priv/nodes) And after them nearly every minute…
0
votes
1 answer

Running jobs in a HPC cluster

I'm quite new to HPC environment. Is there any difference in running a job on a node utilizing 8 cores and running the same job on 8 nodes utilizing I core in terms of performance or walltime used. PS: I'm working on a project which involves…
Ashwin
  • 1
0
votes
0 answers

Microsoft HPC: mixing windows and linux blades

I have a working Windows HPC cluster with 32 blades, all of them are using Windows HPC. My question is: can I install Linux on 16 blades and keep the other 16 on Windows? Is there a specific version of Linux that I can use? update What would I like…
Delta
  • 189
  • 3
  • 9
0
votes
2 answers

Running ScaleMP on top of OpenStack

Looking for a feedback if anyone has already played with running ScaleMP linux appliances in OpenStack (KVM)? A short description of the setup (w/ or w/o InfiniBand, total amount of RAM, etc) and its performance for matrix vector multiplication…