PXE - LiveCD suddenly not loading for many units, works for many others?

Question

I'm at a loss as to what's going wrong for me here. I have a few dozen units that work, and a few dozen units that don't, and they all vary by hardware and platform.

I have a CentOS 7.3 PXE server running cobbler with a number of CentOS-based LiveCD options on it. They worked fine up until this morning, and suddenly now we are seeing the following behavior when trying to load the vanilla CentOS LiveCD from PXE:

Hit enter
Kernel downloads
Initrd downloads, but silently (only 3 "."s show up, but I can tell its downloading watching tcpdump on the server)
The download finishes, the screen flashes, and the PXE menu comes back up
Subsequent retries result in the menu flashing and coming back up with an "invalid kernel parameter" error so briefly that I had to record it with screencap software to even see it. Additionally, only 1 packet is actually sent to the client; it's like it doesn't even attempt to download it on the second try.

The pxe menu entry for the vanilla CentOS LiveCD that looks like this:

/images/centos_livecd/centos_vmlinuz initrd=/images/centos_livecd/centos_livecd_initrd.img ksdevice=bootif lang= root=live:/centos_livecd.iso kssendmac text ks=http://10.101.24.21/cblr/svc/op/ks/profile/centos_livecd BOOTIF=<MAC>

Again - I have about 20 units of varying motherboard and platform NOT working, and about 40 or so units of varying motherboard and platform that ARE working with the exact same menu entry.

Regular installer menu entries work great - CentOS, Ubuntu, etc.

So far I've tried:

Using a vmlinuz from a CentOS install ISO
Monitoring xinetd with "watch -n 1 systemctl status xinetd" and seeing the requests come in
Monitoring tcpdump with "tcpdump -vvi |grep "

I'm at a loss, and I'm desperate. Does anyone have any ideas?

If I can gather more information using a different utility somehow on a system that is loading from PXE I would love to know how.

More information:

While tailing /var/log/messages, I noticed that the first try of loading the LiveCD appears to go swimmingly according to the network, but nothing happens on the client once the initrd.img is downloaded:

Jul 28 15:10:30 jarvis in.tftpd[12496]: RRQ from 10.101.26.176 filename /images/centos_livecd/centos_vmlinuz                                                                                      
Jul 28 15:10:30 jarvis in.tftpd[12496]: Client 10.101.26.176 finished /images/centos_livecd/centos_vmlinuz                                                                                        
Jul 28 15:10:30 jarvis in.tftpd[12501]: RRQ from 10.101.26.176 filename /images/centos_livecd/centos_livecd_initrd.img                                                                            
Jul 28 15:11:39 jarvis in.tftpd[12501]: Client 10.101.26.176 finished /images/centos_livecd/centos_livecd_initrd.img

Anything change in the environment since the last time it worked? Package updates, configuration changes etc. — Joe, Jul 28 '17 at 20:14
Is there anything common about the failing systems separate from the working systems? I'm thinking network path, hardware drivers or somesuch — Joe, Jul 28 '17 at 20:20
Thanks for both comments - the environment wasn't changed, and if it was, I'm not sure why it would work on some units but not others. @Joe, I don't think so - the working units are Intel NUCs and some supermicro based motherboards, and the non-working units are a mish-mash of supermicro gear. — Locane, Jul 28 '17 at 20:32
Is there some kind of logging I can turn on to track what's going on while the kernel loads the initrd from the PXE menu? — Locane, Jul 28 '17 at 20:32
With my second question I was mainly wondering if the failed units all go through a switch that the working units don't or something like that. It's clear that something is different now than before, right? otherwise there wouldn't be failures. Trick is figuring out what that something is. All logs for cobbler go into /var/log/cobbler. Anything in there? tftp as you've found goes into /var/log/messages. The problem is the tftp server doesn't know of an issue usually, it's just sending files. — Joe, Jul 28 '17 at 21:22
The trouble is that things are constantly running on here, and the hardware is constantly changing. I ran livemedia creator to generate a new livecd, and I haven't done that for awhile, but that shouldn't affect the system or the months-old-and-previously-working vanilla centos livecd. And why would it cause this strange behavior? — Locane, Jul 28 '17 at 21:49
I think driver problems with the kernel is a good guess too, but why suddenly would it change? And further, why wouldn't using the regular centos installer kernel work? — Locane, Jul 28 '17 at 21:50

score 0 · Answer 1 · answered Jul 28 '17 at 23:41

We were using files in /var/lib/tftpboot from Syslinux version 4.07, which is .02 past what CentOS 7.3 ships with. We were using these files because 4.05 doesn't support PXE menu chaining, but 4.07 does.

Overwriting the files in /var/lib/tftpboot with files from Syslinux version 4.05 found in /usr/share/syslinux resolved the issue, and removed PXE chaining.

Version 4.07 files worked fine for 2 weeks without issue; I'm still not sure why they suddenly stopped working for some units and not others.

PXE - LiveCD suddenly not loading for many units, works for many others?

1 Answers1