Questions tagged [hpc]

High Performance Computing encompasses using "supercomputers" with high numbers of CPUs, large parallel storage systems and advanced networks to perform time-consuming calculations. Parallel algorithms and parallelization of storage are essential to this field, as well as issues with complex, fast networking fabrics such as Infiniband.

High Performance Computing(HPC) encompasses many aspects of traditional computing and is utilized by a variety of fields including but not limited to particle physics, computer animation/CGI for major films, cancer/genomics research and modeling the climate. HPC systems, sometimes called 'supercomputers' are typically large numbers of high-performance servers with large numbers of CPUs and cores, interconnected by a high speed fabric or network.

A list of the top500 fastest computers on the planet is maintained as well as a list of the 500 most energy efficient computers. The performance of these systems is measured using the LINPACK benchmark, though a new benchmark using a conjugate gradient method, which is more representative of modern HPC workloads. IBM, Cray and SGI are major manufacturers of HPC systems and software though it should be noted that over 70% of the systems on the top500 list are based on Intel platforms.

Interconnect fabric technology is also crucial to HPC systems, many of which rely on internal high-speed networks made up of Infiniband or similar low-latency high-bandwidth networks. In addition to interconnect technology, GPU and coprocessors have recently been gaining in popularity for their ability to accelerate certain types of workloads.

Software is an additional concern for HPC systems as typical programs are not equipped to run on such a large scale. Many hardware manufacturers also produce their own software stacks for HPC systems which include compilers, drivers, parallelization and math libraries, system management interfaces and profiling tools specifically designed to work with the hardware they produce.

Most HPC systems use a highly modified linux kernel that is stripped down to only the essential components required to run the software on supplied hardware. Many modern HPC systems are setup in a 'stateless' manner, which means that no OS data is stored locally on compute nodes and an OS image is loaded into RAM typically over the network using PXE boot. This functionally allows the nodes to be rebooted into a clean, known-good working state. This is desirable in HPC systems as it is sometimes difficult to effectively cleanup processes that were running across several nodes in parallel cleanly.

116 questions
0
votes
1 answer

How do we configure Lustre to block client requests when under load, rather than failing?

We are using Lustre in a cluster with approximately 200TB of storage, 12 Object Storage Targets (that connect to a DDN storage system using QDR Infiniband), and roughly 160 quad and 8-core compute notes. Most of the users of this system have no…
vy32
  • 2,088
  • 2
  • 17
  • 21
0
votes
2 answers

low performance on HPC cluster (sge) when running multiple jobs

O know this is a long-shot but I'm clueless here. I'm running several computer simulations on High Performance Computation cluster (HPC) of oracale grid engine (sge). A single job runs at a certain speed (roughly 80 steps per second) when I add jobs…
Yotam
  • 101
0
votes
1 answer

Windows HPC Server 2008 R2 SP3 upgrade

It's not clear to me from reading the documentation if I must update the clients, or whether I can just update the head node and compute nodes. Does anyone have any experience of this? I don't want to update my cluster and find that my customers can…
nick3216
  • 213
  • 3
  • 10
0
votes
3 answers

Building Windows Clusters

I'm a research student and I want to build a windows cluster at home with my laptops to test my parallel codes. The problem is I'm using Windows 7 Home Premium, not a server edition. I'm using Visual Studio 2010 Ultimate and I installed Microsoft…
0
votes
2 answers

NFS denies mount, even though the client is listed in exports

We have a couple of servers (part of an HPC cluster) in which we're currently seeing some NFS behavior which is not making sense to me. node1 exports its /lscratch directory via NFS to node2, mounted at /scratch/node1. node2 also exports its own…
ajdecon
  • 1,301
  • 4
  • 14
  • 21
0
votes
1 answer

Python Version for HPC with Numpy/Scipy

I am looking to setup an HPC cluster so that it has a modern installation of Python with Numpy/Scipy on the compute nodes. The version of Linux we are using has Python 2.4 installed by default. I know there have been a number of new features and…
dtlussier
  • 103
  • 3
0
votes
0 answers

HPC node, Infiniband is DOWN

I have an HPC with 17 nodes running CentOS 7 and a dedicated Mellanox SX6036 Infiniband switch, each node has an Infiniband FDR interface. Recently one node started giving errors and a quick look showed that the ib0 IPoIB interface was down. 4: ib0:…
Chris Woelkers
  • 298
  • 2
  • 11
0
votes
1 answer

Why does the login node connect to external networks but allocated compute node fail in Slurm-GCP?

I've noticed that connecting to the internet from the allocated compute node via Slurm-GCP keeps failing. For example, using wget from the login node works successfully: [me@gcp-login0 ~]$ wget…
0
votes
0 answers

Setting up Rhel cluster for high perfomance and load balancing

I am having 7 servers with RHEL, I want to setup a cluster for those. We need cluster for: 1: high Perfomance 2: load balancing I have NAS for shared storage. I want to setup 1 server as visulization Node, one will be a master node and rest 5 will…
biplab
  • 5
  • 2
0
votes
1 answer

Linux missing lvm

hello so i have a ubuntu hpc cluster and i got a problem with storage whenever i try to access the storage from my compute nodes i cant i keep getting this error mount:mounting 192.168.100.211:/cm/node-installer or /installer_root failed :operation…
0
votes
0 answers

Singularity vs. Podman for HPC workloads?

Singularity is said to work well for HPC workloads. RedHat is making an effort to make Podman more usable for HPC, e.g. it (and I presume this is with their ubi8 image) is said to work with MPI. That's about all I know. Does anyone have an opinion…
Cavalcade
  • 9
  • 2
0
votes
0 answers

Not able to ssh into 2 Compute Nodes on HPE Cluster

I recently added two new Compute Nodes on HPE CLuster , But surprisingly, I am Unable to ssh into the new Compute Nodes from the Head Node . [Unable to SSH to new Compute Nodes][1] (base) [root@hn001 ~]# su harender [harender@hn001…
0
votes
1 answer

Single SSH login for multilpe machines?

I have a number of physical (desktop) machines running at the office as part of a new network to handle processing & serving Open Source data; some of these machines also house VMs. At the moment, if an employee wants admin access to this network, I…
0
votes
0 answers

Setting up slurm on a cluster

My IT admin has setup a cluster with 3 nodes, which is administered via Windows server. VMs are hosted via Hyper-V, including an Ubuntu VM to which a substantial portion of the cluster's resources have been allocated. Does anyone have any…
0
votes
1 answer

Changing the subnet on which my beegfs cluster operates

I've added some fiber channels between the machines that constitutes my BeegFS cluster in an effort to increase throughput. However, I have to leave the old coppy ethernet in place with its addressing intact for backwards compatibility. Is there a…
Jarmund
  • 535
  • 2
  • 6
  • 17