Questions tagged [hpc]

High Performance Computing encompasses using "supercomputers" with high numbers of CPUs, large parallel storage systems and advanced networks to perform time-consuming calculations. Parallel algorithms and parallelization of storage are essential to this field, as well as issues with complex, fast networking fabrics such as Infiniband.

High Performance Computing(HPC) encompasses many aspects of traditional computing and is utilized by a variety of fields including but not limited to particle physics, computer animation/CGI for major films, cancer/genomics research and modeling the climate. HPC systems, sometimes called 'supercomputers' are typically large numbers of high-performance servers with large numbers of CPUs and cores, interconnected by a high speed fabric or network.

A list of the top500 fastest computers on the planet is maintained as well as a list of the 500 most energy efficient computers. The performance of these systems is measured using the LINPACK benchmark, though a new benchmark using a conjugate gradient method, which is more representative of modern HPC workloads. IBM, Cray and SGI are major manufacturers of HPC systems and software though it should be noted that over 70% of the systems on the top500 list are based on Intel platforms.

Interconnect fabric technology is also crucial to HPC systems, many of which rely on internal high-speed networks made up of Infiniband or similar low-latency high-bandwidth networks. In addition to interconnect technology, GPU and coprocessors have recently been gaining in popularity for their ability to accelerate certain types of workloads.

Software is an additional concern for HPC systems as typical programs are not equipped to run on such a large scale. Many hardware manufacturers also produce their own software stacks for HPC systems which include compilers, drivers, parallelization and math libraries, system management interfaces and profiling tools specifically designed to work with the hardware they produce.

Most HPC systems use a highly modified linux kernel that is stripped down to only the essential components required to run the software on supplied hardware. Many modern HPC systems are setup in a 'stateless' manner, which means that no OS data is stored locally on compute nodes and an OS image is loaded into RAM typically over the network using PXE boot. This functionally allows the nodes to be rebooted into a clean, known-good working state. This is desirable in HPC systems as it is sometimes difficult to effectively cleanup processes that were running across several nodes in parallel cleanly.

116 questions
1
vote
2 answers

Task spooler for computing server on Debian

Recently our university has bought an computing server with one multi-core Xeon and 4 powerfull GeForce videocard for lessons on discipline "High perfomance computing with CUDA". There is Debian Squeeze on it. I'm trying to find a solution for…
Kirill
  • 143
  • 1
  • 6
1
vote
4 answers

A dozen mac mini's vs dell rack server for parallel image processing

I need to do some large scale image processing in parallel. I was thinking of running a dozen Mac Mini's in parallel to do the data processing. I need to run Microsoft Windows on the machines so I can pull the data from the network using an Active X…
Naveen
  • 121
  • 1
  • 5
1
vote
1 answer

Can't force HPZ600 workstation into PXE

I've got a HPC 2008 cluster of HP Z600 workstatsions, and though my head node can add them to the cluster (using node.xml files), and can reboot them, when the Z600 powers on the PXE attempt just times out. When the cluster node boots up the head…
nick3216
  • 213
  • 3
  • 10
1
vote
4 answers

high level server design/programming question

I am interested in designing a simple lan based server which accepts and services a limited number of connections ( < 25) from within the lan at any time. The server generates images dynamically and transmits them to the clients at speeds of ~40-50…
John Qualis
  • 125
  • 4
1
vote
1 answer

What cluster management software does there exist for Windows, except for MS HPC Server?

Do there exist any open-source solutions? Perhaps, cheaper commercial ones? I'm surprised that googling does not yield anything meaningful for me.
jkff
  • 293
  • 1
  • 5
  • 10
1
vote
1 answer

Which compute node did a Sun Grid Engine job execute on?

What is the easiest way to determine which node a compute job was executed on, using Sun Grid Engine? qstat seems to only list running/queued jobs
pufferfish
  • 2,830
  • 11
  • 39
  • 40
1
vote
2 answers

Windows HPC Server 08 suitability for Matlabs

I want to setup another Hyper-V VM for installing Matlabs/doing some compute-intensive programming using C. I keep thinking that Windows Server HPC 2008 is designed for this sort of work. Would I be on the right track to setup a single VM with this…
GurdeepS
  • 1,646
  • 5
  • 26
  • 33
1
vote
1 answer

Rocksclusters reinstalling nodes partitioning error

I have a HPC based on rocksclusters So when I've added new roll (torque) I send a kickstart command to all nodes to reinstall them. But after loading X installer on nodes all of them showed me an error: Could not allocate requested partitions:…
Antiarchitect
  • 253
  • 2
  • 6
1
vote
2 answers

Security Concerns for High Performance Clusters

This is a VERY open question since this is my first time creating a cluster. I'm just wondering what type of security concerns will there be and how to prevent them. Background information Using SGE (currently installing and figuring out which…
user36181
1
vote
2 answers

Install software on Ubuntu using apt without root?

I’ve got accounts on many HPC clusters. The machines have a minimal install, and the admins won’t add much else. I need to install lots of typical software. Normally I’d do this with apt, but of course I don’t have root on these machines. Some…
projectshave
  • 154
  • 5
1
vote
1 answer

HPC master node no infiniband network influence on compute nodes - Slurm management

I'm writing because I'm facing an issue that I cannot solve trying to configure a cluster with a master node ( or Frontend node ) as a Virtual machine managing nodes with infiniband network. I use slurm on this nodes, the frontend node is the slurm…
SimoneM
  • 121
  • 1
1
vote
1 answer

Why can't the GPUs communicate in a multi-GPU server?

This is a Dell PowerEdge r750xa server with 4 Nvidia A40 GPUs, intended for AI applications. While the GPUs work well individually, multi-GPU training jobs or indeed any multi-GPU computational workload fails where at least 2 GPUs have to exchange…
isarandi
  • 341
  • 2
  • 11
1
vote
1 answer

Infiniband OpenSM N-to-N port routing configuration

I have 10 servers with two CPUs each and one Mellanox 100G Infiniband NIC per CPU. Each NIC is connected to a single Mellanox 36 port 100G IB switch. My RDMA application runs as one process per NUMA node and binds to the local NIC to avoid cross CPU…
Hugo Maxwell
  • 121
  • 4
1
vote
0 answers

Why a non-root installation can work across the whole cluster?

I recently installed anaconda (which includes a new python3) locally in my account folder on a cluster with a dozen of nodes (each node with several cores). I use it to install some package P that is used in my python programs. --- In short, I…
xiaohuamao
  • 111
  • 2
1
vote
1 answer

Latency of memory accesses via interconnectors

I'm trying to compare latencies of different node interconnects for a cluster. The goal is to minimize the memory access latency. I have obtained some benchmarks regarding one of the hardware implementations of NUMA architecture with many CPUs. This…
Piotr M
  • 33
  • 3