1

This is a follow on from a previous question (Switch sending DHCP packets to wrong VLAN) but the problem turns out that it's not with the switch, but more so in the NIC hardware I believe.

Basically, I'm seeing broadcast traffic leak across VLAN's on a HP N40L with the Intel 82574L NIC.

First, DHCPDISCOVER appears in both VLAN's (the untagged 1 and tagged 10)

Jul 23 06:51:50 gateway dhcpd: DHCPDISCOVER from 90:84:0d:9c:13:df via eth0.10
Jul 23 06:51:50 gateway dhcpd: DHCPDISCOVER from 90:84:0d:9c:13:df via eth0: network 192.168.100.0/25: no free leases

DHCPOFFER is sent back to VLAN10 only because VLAN1 has no free leases

Jul 23 06:51:51 gateway dhcpd: DHCPOFFER on 192.168.100.207 to 90:84:0d:9c:13:df (iPhone) via eth0.10

DHCPREQUEST for the same address appears in both VLAN's again:

Jul 23 06:51:52 gateway dhcpd: DHCPREQUEST for 192.168.100.207 (192.168.100.200) from 90:84:0d:9c:13:df (iPhone) via eth0.10
Jul 23 06:51:52 gateway dhcpd: DHCPACK on 192.168.100.207 to 90:84:0d:9c:13:df (iPhone) via eth0.10
Jul 23 06:51:52 gateway dhcpd: DHCPREQUEST for 192.168.100.207 (192.168.100.200) from 90:84:0d:9c:13:df (iPhone) via eth0: wrong network.
Jul 23 06:51:52 gateway dhcpd: DHCPNAK on 192.168.100.207 to 90:84:0d:9c:13:df via eth0

The switch has been replaced since my original question where I thought it was the switch. It was a Cisco switch, I've replaced it with a HP. I have dozens of HP switches that I configure and manage, and I've triple checked the config and 100% sure it's correct. The relevant config (where 25 is the N40L, 26 is the WAP):

vlan 1 
   name "LAN" 
   untagged 1-25 
   ip address 192.168.100.99 255.255.255.128 
   no untagged 26 
   exit 
vlan 10 
   name "WLS" 
   untagged 26 
   no ip address 
   tagged 25 
   exit 

And the config on the server (CentOS 6)

gateway ~ # ip a s eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether a0:b3:cc:e7:58:2e brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.100/25 brd 192.168.100.127 scope global eth0
gateway ~ # ip a s eth0.10
4: eth0.10@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether a0:b3:cc:e7:58:2e brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.200/25 brd 192.168.100.255 scope global eth0.10

My questions are: 1) Has anyone else seen similar behaviour from this hardware? Google doesn't return any info. 2) What else can I do to confirm that this is a NIC issue? 3) Any magic solution? ;)

EDIT: VLAN 10 is tagged on the port to the N40L, and untagged on only 1 other port which definitely goes to the WAP (otherwise we wouldn't be seeing iPhone's requesting leases) so the VLAN's can't be accidentally cross-patched:

hp-switch# sho vlan 10

 Status and Counters - VLAN Information - Ports - VLAN 10

  802.1Q VLAN ID : 10          
  Name : WLS         
  Status : Port-based  

  Port Information Mode     Unknown VLAN Status    
  ---------------- -------- ------------ ----------
  25               Tagged   Learn        Up        
  26               Untagged Learn        Up        

Here is the full running config for the switch; it's quite a simple configuration: http://pastebin.com/5Zt76nAF

fukawi2
  • 5,396
  • 3
  • 32
  • 51
  • Do you have a DHCP relay configured on your switch? – Paul Gear Jul 23 '13 at 10:29
  • @PaulGear a request routed via a DHCP relay would have come in with the RelayAddress set accordingly so the lease would be given out of the scope matching the client's subnet. Plus, the log would have said so. – the-wabbit Jul 23 '13 at 11:54
  • @syneticon-dj Good point. – Paul Gear Jul 23 '13 at 20:55
  • @fukawi2 One way of confirming where the leak lies would be to change your switch config to limit the scope of VLAN 1 and then increase it again until you see the problem. Of course this might not be possible if you're running production traffic on it. – Paul Gear Jul 23 '13 at 20:57
  • @PaulGear Could you elaborate what you mean by changing the scope of VLAN 1? There is production traffic, but it's a small site with lots of 'dead' time when no-one uses the network (a fire station). – fukawi2 Jul 23 '13 at 23:25
  • I mean if you can get some downtime, change the VLAN so it's only on a limited range of ports, to see whether @syneticon-dj's suggestion is right; then you'll be able to track down where it's being bridged. – Paul Gear Jul 25 '13 at 02:05
  • Ah I see; they can't be accidentally patched to bridge the 2; there is only 1 untagged port for VLAN 10 and that DEFINITELY goes to the WAP. I'll update the question with the config output showing that. – fukawi2 Jul 25 '13 at 23:36
  • Build yourself an Ethernet tap. You will probably have to set the port speed to 100. Capture just one side to make sure that your server is sending the packet. Then you will know if you should investigate the server side or the network side. – longneck Jul 26 '13 at 00:28
  • @longneck Server is sending which packet? The DHCP duplicates are coming from the client to the server (as the server sees it). I can setup a mirror port and tcpdump that which will be easier than building a tap ;) – fukawi2 Jul 26 '13 at 01:11

1 Answers1

2

Note that while a DHCP DISCOVER is wrapped into a UDP broadcast packet, a DHCP REQUEST is unicast - so you do not just have broadcast traffic "leaking".

Based on your description, I would do further investigation if you have not inadvertently bridged the VLANs (e.g. by misconfiguring software bridging on the N40L or by running a patch cord from an untagged port of VLAN 10 to an untagged port of VLAN 1). Run tcpdump on a different host to see if a VLAN carries "foreign" traffic to rule this out first.

While this also might be a bug, contrary to the wide-spread belief the VLAN functionality is not a NIC feature but implemented at the software layer in the Linux kernel. So if it were a bug, it would affect more than just one NIC / host type and surely would have been documented somewhere already.

the-wabbit
  • 40,737
  • 13
  • 111
  • 174
  • 1
    Thanks for your input :) There is definitely no patch cable joining the VLAN's; the WAP port (26) is the only port untagged with VLAN 10. There is no bridging configured on the N40L, it is acting only as a router (in fact, `brctl` isn't even installed). I'll grab my laptop and tcpdump another VLAN1 port to see if there's any VLAN10 traffic on it. I agree that if it were a bug in the kernel's 802.1q code that it would be amazing for me to be the first to come across it, but given it's happened with 2 different switches of different brands, it's quite a strange issue. – fukawi2 Jul 23 '13 at 23:23