3

I have a functional xenserver 6.5 pool with two nodes. It is backed by an iscsi share on a Dell MD3600i SAN, and this works fine. It was set up before my time.

We've added three more nodes to the pool. However these three new nodes will not connect to the storage.

Here's one of the original nodes, working fine:

[root@node1 ~]# iscsiadm -m session
tcp: [2] 10.19.3.11:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [3] 10.19.3.14:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [4] 10.19.3.12:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [5] 10.19.3.13:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)

Here's one of the new nodes. Notice the corruption in the address?

[root@vnode3 ~]# iscsiadm -m session
tcp: [1] []:-1,2 ▒A<g▒▒▒-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [2] 10.19.3.12:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [3] 10.19.3.11:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [4] 10.19.3.14:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)

The missing IP address is .13 but another node is missing .12

Comments:

I have live running production VMs on the existing nodes and nowhere to move them, so rebooting the SAN is not an option.

Multipathing is disabled on the original nodes, despite the san having 4 interfaces. This seems sub optimal so I've turned on multipathing on the new nodes.

The three new nodes have awfully high system loads. Original boxes have a load average of 0.5 to 1, and the three new nodes are sitting at about 11.1, with no VMs running. top shows no high CPU processes, so its something kernel-related ? There are no processes locked in state D (uninterruptable sleep)

If I tell Xencenter to "repair" those Storage Repositories it sits spinning its wheels for hours till I hit cancel. The message is Plugging PDB for node5

Question: How do I get my new xenserver pool members to see the pool storage and work like expected ?

EDIT Further information

  • None of the new nodes will do a clean reboot either - they get wedged in "stopping iSCSI" on a reboot and I have to use the drac to remotely repower them.
  • Xencenter is adamant that the nodes are in maintenance mode and that they haven't finished booting.

Good pool node:

[root@node1 ~]# multipath -ll
36f01faf000eaf7f90000076255c4a0f3 dm-36 DELL,MD36xxi
size=3.3T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=12 status=enabled
| |- 14:0:0:6 sdg 8:96  active ready running
| `- 15:0:0:6 sdi 8:128 active ready running
`-+- policy='round-robin 0' prio=11 status=enabled
  |- 12:0:0:6 sdc 8:32  active ready running
  `- 13:0:0:6 sdh 8:112 active ready running
36f01faf000eaf6fd0000098155ad077f dm-35 DELL,MD36xxi
size=917G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=14 status=enabled
| |- 12:0:0:5 sdb 8:16  active ready running
| `- 13:0:0:5 sdd 8:48  active ready running
`-+- policy='round-robin 0' prio=9 status=enabled
  |- 14:0:0:5 sde 8:64  active ready running
  `- 15:0:0:5 sdf 8:80  active ready running

Bad node

[root@vnode3 ~]# multipath
Dec 24 02:56:44 | 3614187703d4a1c001e0582691d5d6902: ignoring map
[root@vnode3 ~]# multipath -ll
[root@vnode3 ~]#                           (ie no response at all, exit code was 0)

Bad node

[root@vnode3 ~]# iscsiadm -m session
tcp: [1] []:-1,2 ▒A<g▒▒▒-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [2] 10.19.3.12:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [3] 10.19.3.11:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [4] 10.19.3.14:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)

[root@vnode3 ~]# iscsiadm -m node --loginall=all
Logging in to [iface: default, target: iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb, portal: 10.19.3.13,3260] (multiple)
^C iscsiadm: caught SIGINT, exiting...

So it tries to log into an IP on the SAN, but spins its wheels for hours till I hit ^C.

Criggie
  • 2,379
  • 14
  • 25

2 Answers2

2

If the iSCSI discovery doesn't work, it's probably a matter of the IQN on the XenSerever host, the MD3600i or both not recognizing each other. Make sure the MD3600i is allowed access from all your IQNs on all your XenServer host using Dell's MDSM utility and then try to redo the iSCSI discovery:

iscsiadm -m discovery -t st -p (MD3600i-primary-controller-IP-address)

iscsiadm -m node --loginall=all

iscsiadm -m session

You should be at least able to ping the primary IP address of the MD3600i from your XenServers if you have network access.

Also note that you'll need to first set up separate iSCSI interfaces on the NICs associated with each new XenServer and assign static IP addresses to those that are unique and on the same subnets as those of your other hosts' iSCSI conenctions.

I hope that helps, --Tobias

Tobias K
  • 21
  • 2
  • On the two working hosts, that command completes quickly, listing all four IPs on the SAN. On the three new nodes, it simply sits and does nothing - 15 minutes later I abort it with a ^C. Curiously, in another session tcpdump shows no traffic to the SAN when running the discovery command. tcpdump shows ping traffic so I know that's right. – Criggie Dec 24 '15 at 03:58
  • 1
    Verify that the IQNs of your new hosts are allowed access to your MD3600i by checking the settings with the Dell storage management MDSM utility. Unless those new servers are allowed to connect to the LUN(s) in question, you will not be able to access the storage from your new XenServers. – Tobias K Dec 24 '15 at 09:15
  • Excellent thought - its all there in the admin software. Note one new node has problems with one IP, and the others have problems with a different IP. I'm flummoxed. – Criggie Dec 24 '15 at 09:51
  • 1
    Verify the IQNs on the XenServer hosts match woul dbe th enext step. Make sure each NIC is configured with a correct IP address. A good test is to see if the NICs used for iSCSI connection can ping their counterparts on the other XenServer hosts, as well as the primary controller on the MD3600i. Also -- perhaos too obvious -- make sure multipathing is enabled on the new servers! This sort of thing involves a lot of small steps and they all have to be "right" for things to function properly. :-) – Tobias K Dec 24 '15 at 15:02
0

For closure, there were multiple things wrong.

  1. The hosts were configured for a 1500 byte MTU, whereas the storage SAN was using 9216 byte MTU.
  2. One of the hosts had a subtly-different IQN from reality - the SAN listed the correct IQN as "unassigned" even though it was visually the same as the IQN in use.
  3. My original two nodes had management IPs configured on their on-board 1 Gbit card. The three new nodes had an acceptable management IP configured on the bonded interface, in a vlan. This was too different and mostly stopped the new hosts from coming out of maintanence mode after a boot.

Multipath seemed to have no bearing on the problem at all.

Deleting and fiddling around with files in /var/lib/iscsi/* on the xenserver nodes had no impact on the problem.

I had to use other means to reboot these newer boxes too - they would wedge up trying to stop the iscsi service.

And finally the corruption in the IQN name visible in iscsiadm -m session has vanished completely. This was possibly related to the MTU mismatch.

For future internet searchers - good luck!


Edit: in September 2021, I had exactly the same issue, with a dell MD3800 SAN and some xcp-ng servers. Again, it was caused by mismatched MTU. And Google just happens to serve up this question, which I had completely forgotten. Just goes to show how important it is to provide closure for future readers... that reader might be you.

Criggie
  • 2,379
  • 14
  • 25
  • I'm not going to edit, but yes, a year later the same biscuit reappeared and I'm chasing down MTU issues screwing over iscsi. – Criggie Sep 30 '22 at 07:30