Questions tagged [hpc]

High Performance Computing encompasses using "supercomputers" with high numbers of CPUs, large parallel storage systems and advanced networks to perform time-consuming calculations. Parallel algorithms and parallelization of storage are essential to this field, as well as issues with complex, fast networking fabrics such as Infiniband.

High Performance Computing(HPC) encompasses many aspects of traditional computing and is utilized by a variety of fields including but not limited to particle physics, computer animation/CGI for major films, cancer/genomics research and modeling the climate. HPC systems, sometimes called 'supercomputers' are typically large numbers of high-performance servers with large numbers of CPUs and cores, interconnected by a high speed fabric or network.

A list of the top500 fastest computers on the planet is maintained as well as a list of the 500 most energy efficient computers. The performance of these systems is measured using the LINPACK benchmark, though a new benchmark using a conjugate gradient method, which is more representative of modern HPC workloads. IBM, Cray and SGI are major manufacturers of HPC systems and software though it should be noted that over 70% of the systems on the top500 list are based on Intel platforms.

Interconnect fabric technology is also crucial to HPC systems, many of which rely on internal high-speed networks made up of Infiniband or similar low-latency high-bandwidth networks. In addition to interconnect technology, GPU and coprocessors have recently been gaining in popularity for their ability to accelerate certain types of workloads.

Software is an additional concern for HPC systems as typical programs are not equipped to run on such a large scale. Many hardware manufacturers also produce their own software stacks for HPC systems which include compilers, drivers, parallelization and math libraries, system management interfaces and profiling tools specifically designed to work with the hardware they produce.

Most HPC systems use a highly modified linux kernel that is stripped down to only the essential components required to run the software on supplied hardware. Many modern HPC systems are setup in a 'stateless' manner, which means that no OS data is stored locally on compute nodes and an OS image is loaded into RAM typically over the network using PXE boot. This functionally allows the nodes to be rebooted into a clean, known-good working state. This is desirable in HPC systems as it is sometimes difficult to effectively cleanup processes that were running across several nodes in parallel cleanly.

116 questions
1
vote
0 answers

Torque queue issue

I am having troubles with Torque + Maui. The problem is the following: I have 2 queues, each queue has 10 associated nodes. If i submit 10k jobs to the first queue and i submit 1 job to the second one, the job in the second one remains in Q…
Andrea
  • 11
  • 1
1
vote
1 answer

deploying base note for HPC cluster in server 2012R2 hangs

We are in the process of evaluating server 2012R2 standard with HPC pack for a small cluster of nodes (about 40 to start with - the current setup has only one compute node). For the moment we use old hardware to try out things and get a feeling for…
Holly
  • 133
  • 7
1
vote
1 answer

No service for subscription

I am trying to set up bursting to Azure with a Windows HPC cluster. THe cluster already works fine and I can start jobs on the machines on that are on the local network. When I try and create a node template for Azure nodes, I enter my subscription…
1
vote
1 answer

Parallel filesystem which schedules simultaneous file requests to mutually exclusive sets of OSSs

My environment is RHEL based, interconnect is infiniband. I have some experience with Lustre. What i want to know is: Is there a parallel file-system, where if simultaneous write request arrive, they are scheduled on mutually exclusive sets of…
hrs
  • 151
  • 6
1
vote
1 answer

Windows VMs on ScaleMP cluster?

I was wondering whether the following stack would work, and if so then how well, and what sort of problems I might expect to encounter when setting it up? Hardware layer - lots of cheap servers SMP layer - ScaleMP OS - Linux 64-bit (e.g. Red Hat…
user3490
  • 186
  • 1
  • 9
1
vote
1 answer

HPC / EC2 - optimizing NFS for relialibility

In AWS-EC2, I've set-up a cluster of linux virtual machines made of an NFS fileserver and many clients. If the number of clients is above ~20, under heavy I/O, I am experiencing loss of file integrity: e.g. gzipped files written by a client to the…
1
vote
3 answers

Appropriate network file system for huge (5+ Gb) files

I've got a number of servers used for HPC / cluster computing and I noticed that, given the fact that part of the computations they run use huge files over NFS, this causes significant bottlenecks. I'm wondering how to address the issue. The…
Einar
  • 225
  • 2
  • 11
1
vote
1 answer

Open MPI can't launch remote nodes via SSH

I am trying to set up Open MPI between a few machines on out network. Open MPI works fine locally, but I just can't get it to work on a remote node. I can ssh into the remote machine (without password) just fine, but if I try something like mpiexec…
oceanhug
  • 121
  • 4
1
vote
1 answer

HPC OSS Node issue with unreadable local hdd error

We have a HPC setup with four OSS server(OSS1 to OSS4) and two MDS Nodes(MDS1 to MDS2) It has been running till yesterday without any problem. Today morning i found that OSS4 is in shutdown condition. I have verified the OSS3 logs and found that it…
Newton
  • 11
  • 1
1
vote
1 answer

Torque and maui node status

I am new for torque and maui. I was checking for node state to looking for which nodes are free and which nodes are in use. For torque one command is pbsnodes. Which gives status and other info related to node. When I was checking for maui then I…
Nilesh
  • 255
  • 1
  • 6
  • 18
1
vote
0 answers

Programs running in an NX session seem to pause when session disconnects

We are currently running an interactive HPC application which presents a graphical interface to the user, attaches to an HPC cluster and allows them to run and observe some computation. The user logs in to a front-end node via NoMachine NX Server…
ajdecon
  • 1,301
  • 4
  • 14
  • 21
1
vote
1 answer

Non-exclusive job scheduling in PBS/Torque

The cluster resource manager Torque typically allocates compute nodes on an exclusive basis. However, when you have a lot of small jobs (like we do) running against multi-core compute nodes, this can result in a lot of wasted resources. Is there…
ajdecon
  • 1,301
  • 4
  • 14
  • 21
1
vote
2 answers

HPC 2008 Time Share Single Core (Fractional Resource Scheduling)

Windows HPC 2008 appears to be restricted to one task per core. Is there anyway to time share multiple tasks (or jobs) over a single core?
TownCube
  • 155
  • 5
1
vote
2 answers

hardware support for parallel file system

I am looking forward to use parallel file system using MPI in linux cluster. I am wondering if parallel file systems like lustre/Parallel Virtual File System require special hardware support(special hard-disks).
justin waugh
1
vote
1 answer

Optimizing Linux Compute Cluster

I am setting up a supercomputing Linux cluster at work. We ran the most recent HPCC benchmarks using OpenMPI and GoToBlas2 but got really bad results. When I ran the benchmarks using one process for every core in the cluster, the results were much…
Zhehao Mao
  • 186
  • 3