2

I get a cluster of which the nodes are connected in fat tree IB. The switches are Qlogic 12300.

The problem I have is certain nodes can't talk with each other. Even there are other nodes, which can talk with both of the impacted nodes.

I used ibtracert to diag the problem. The amazing thing is if I run that command on a separate node which can talk with both the nodes, they are fine and reported a feasible route.

However the ibtracert command run into error if I issued it from the two impacted nodes.

Can I ask what the likely reason for this?

Thanks.

Wei
  • 718
  • 1
  • 6
  • 18

1 Answers1

2

The two HCAs cannot talk to each other because that's how the routing in your subnet is configured. The fact that you can talk from a third machine to both of the "problematic" machines indicates that this is not hosts' problem, but subnet problem.

Infiniband routing is a complicated issue, and just by your description I can't tell how to fix it.

In general, Subnet Manager is calculating and configuring routing on all switches. What kind of Subnet Manager are you using? Is it OpenSM that runs on some host, or Qlogic's SM that runs embedded on one of the switches?

If it's Qlogic, you need to go to their management UI and change/fix routing algorithm. If it's OpenSM, you can run it with "minhop" routing (run "opensm -h" to see usage) - this should fix the problem. However, this won't really FIX the problem - you probably have something bad in the subnet topology, and this is where you need to focus if/once minhop routing solves the issue.

kliteyn
  • 1,917
  • 11
  • 24
  • Thanks for the reply. Very helpful. It is a Qlogic's SM running embedded on one of the core switch. I am using the fat-tree routing. I noticed between one of the core switch and a leaf switch, there are two cables showing state of "link up" but status is not active, but "initialisation". Does this suggest the two cables are bad? If bad, why it shows linkup? – Wei Feb 28 '14 at 00:40
  • I'm not sure what you're referring to by "state" and "status". Each port has two types of states: physical and logical. "State" and "status" are probably Qlogic's vocabulary. I'm guessing that you see physical state as "link up" and logical as "init". The fact that this is the state that you're seeing on ports that are connected to switches hints that either your SM is down/stuck or there is some problem on one of the switches. Cables would be my very last suspect here. Check your SM (perhaps restart it), or reboot the core switch that has the problem, or the leaf switch. – kliteyn Mar 01 '14 at 21:02
  • I've seen the links stuck at initialization before, and had traced it down to a bad port on an ib switch. Due to the number of things going through that switch I was unable to do a test reboot, and just bypassed the port. – MrBooks Apr 15 '14 at 15:29