Regarding ICMP "Fragmentation needed, DF bit set" or ICMP packet too big message

Question

I'm injecting ICMP "Fragmentation needed, DF bit set" into the server and ideally server should start sending packets with the size mentioned in the field 'next-hop MTU' in ICMP. But this is not working.

Here is the server code:

#!/usr/bin/env python 
import socket               # Import socket module
import time
import os

range= [1,2,3,4,5,6,7,8,9]
s = socket.socket()         # Create a socket object
host = '192.168.0.17'                   # Get local machine name
port = 12349               # Reserve a port for your service.
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind((host, port))        # Bind to the port
rand_string = os.urandom(1600)

s.listen(5)                 # Now wait for client connection.
while True:
   c, addr = s.accept()     # Establish connection with client.
   print 'Got connection from', addr
   for i in range:
    c.sendall(rand_string)
        time.sleep(5)
   c.close()

Here is the client code:

#!/usr/bin/python           # This is client.py file

import socket               # Import socket module

s = socket.socket()         # Create a socket object
host = '192.168.0.17' # Get local machine name
port = 12348              # Reserve a port for your service.
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.connect((host, port))
while 1:
    print s.recv(1024)
s.close()

Scapy to inject ICMP:

###[ IP ]###
  version= 4
  ihl= None
  tos= 0x0
  len= None
  id= 1
  flags= DF
  frag= 0
  ttl= 64
  proto= ip
  chksum= None
  src= 192.168.0.45
  dst= 192.168.0.17
  \options\
###[ ICMP ]###
  type= dest-unreach
  code= fragmentation-needed
  chksum= None
  unused= 1300

Send(ip/icmp)

Unused field shows as next-hop MTU in wireshark. Is server smart enough to check that DF Bit was not set when it was communicating with client and it is still receiving ICMP "Fragmentation needed, DF bit set" message? If it is not then why is server not reducing its packet size from 1500 to 1300?

score 6 · Accepted Answer · edited Oct 07 '21 at 05:54

6

First of all, let's answer your first question (is ICMP sent over TCP?).

ICMP runs directly over IP, as specified in RFC 792:

ICMP messages are sent using the basic IP header.

This can be a bit confusing as ICMP is classified as a network layer protocol rather than a transport layer protocol but it makes sense when taking into account that it's merely an addition to IP to carry error, routing and control messages and data. Thus, it can't rely on the TCP layer to transfer itself since the TCP layer depends on the IP layer which ICMP helps to manage and troubleshoot.

Now, let's deal with your second question (How does TCP come to know about the MTU if ICMP isn't sent over TCP?). I've tried to answer this question to the best of my understanding, with reliance on official specifications, but perhaps the best approach would be to analyze some open source network stack implementation in order to see what's really going on...

The TCP layer may come to know of the path's MTU value even though the ICMP message is not layered upon TCP. It's up to the implementation of OS the network stack to notify the TCP layer of the MTU so it can then use this value to update its MSS value.

RFC 1122 requires that the ICMP message includes the IP header as well as the first 8 bytes of the problematic datagram that triggered that ICMP message:

Every ICMP error message includes the Internet header and at least the first 8 data octets of the datagram that triggered the error; more than 8 octets MAY be sent; this header and data MUST be unchanged from the received datagram.

In those cases where the Internet layer is required to pass an ICMP error message to the transport layer, the IP protocol number MUST be extracted from the original header and used to select the appropriate transport protocol entity to handle the error.

This illustrates how the OS can pinpoint the TCP connection whose MSS should be updated, as these 8 bytes include the source and destination ports.

RFC 1122 also states that there MUST be a mechanism by which the transport layer can learn the maximum transport-layer message size that may be sent for a given {source, destination, TOS} triplet. Therefore, I assume that once an ICMP Fragmentation needed and DF set error message is received, the MTU value is somehow made available to the TCP layer that can use it to update its MSS value.

Furthermore, I think that the application layer that instantiated the TCP connection and taking use of it may handle such messages as well and fragment the packets at a higher level. The application may open a socket that expects ICMP messages and act accordingly when such are received. However, fragmenting packets at the application layer is totally transparent to the TCP & IP layers. Note that most applications would allow the TCP & IP layers to handle this situation by themselves.

However, once an ICMP Fragmentation needed and DF set error message is received by a host, its behavior as dictated by the lower layers is not conclusive.

RFC 5927, section 2.2 refers to RFC 1122, section 4.2.3.9 which states that TCP should abort the connection when an ICMP Fragmentation needed and DF set error message is passed up from the IP layer, since it signifies a hard error condition. The RFC states that the host should implement this behavior, but it is not a must (section 4.2.5). This RFC also states in section 3.2.2.1 that a Destination Unreachable message that is received MUST be reported to the TCP layer. Implementing both of these would result in the destruction of a TCP connection when an ICMP Fragmentation needed and DF set error message is received on that connection, which doesn't make any sense, and is clearly not the desired behavior.

On the other hand, RFC 1191 states this in regard to the required behavior:

RFC 1191 does not outline a specific behavior that is expected from the sending host, because different applications may have different requirements, and different implementation architectures may favor different strategies [This leaves a room for this method-OA].

The only required behavior is that a host must attempt to avoid sending more messages with the same PMTU value in the near future. A host can either cease setting the Don't Fragment bit in the IP header (and allow fragmentation by the routers in the way) or reduce the datagram size. The better strategy would be to lower the message size because fragmentation will cause more traffic and consume more Internet resources.

For conclusion, I think that the specification is not definitive in regard to the required behavior from a host upon receipt of an ICMP Fragmentation needed and DF set error message. My guess is that both layers (IP & TCP) are notified of the message in order to update their MTU & MSS values, respectively and that one of them takes upon the responsibility of retransmitting the problematic packet in smaller chunks.

Lastly, regarding your implementation, I think that for full compliance with RFC 1122, you should update the ICMP message to include the IP header of the problematic packet, as well as its next 8 bytes (though you may include more than just the first 8 bytes). Moreover, you should verify that the ICMP message is received before the corresponding ACK for the packet to which that ICMP message refers. In fact, just in order to be on the safe side, I would abolish that ACK altogether.

Here is a sample implementation of how the ICMP message should be built. If sending the ICMP message as a response to one of the TCP packets fails, I suggest you try sending the ICMP message before even receiving the TCP packet to which it relates at first, in order to assure it is received before the ACK. Only if that fails as well, try abolishing the ACK altogether.

edited Oct 07 '21 at 05:54

Community

1
1

answered Nov 15 '14 at 18:10

Yoel

9,144
7
42
57

Yoel thank you very much. This details was very much needed. If IP handles fragmentation then all the connections will have the same path MTU decided by IP layer and this would create chaos. As far as I know path MTU is connection oriented and it is different for different connection. If this is the case how would TCP extracts next-hop MTU reported by ICMP packet too big message. – Aaron88 Nov 15 '14 at 19:55
Since the MTU may differ per destination (rather than per connection), it is up to the IP layer to handle it correctly for each destination. I'm unaware of the specific implementation details, but I suspect it stores the last known MTU (only if it differs from the default MTU) of each destination for which a packet was sent in the last X seconds. – Yoel Nov 15 '14 at 20:02
Thank you Yoel for this much needed answer. I agree it makes sense for path MTU discovery to be destination based than connection based. But according to RFC 5297 this is handled by TCP, TCP validates the received ICMP to test whether it is genuine ICMP or forged ICMP and if it is genuine it updates the path MTU according to the next-hop MTU reported by ICMP. – Aaron88 Nov 15 '14 at 20:27
Are you sure it's [RFC 5297](http://tools.ietf.org/html/rfc5297) (Synthetic Initialization Vector (SIV) Authenticated Encryption Using the Advanced Encryption Standard (AES))? – Yoel Nov 15 '14 at 20:30
I've edited my answer to the best of my understanding, with reliance on official specifications, but perhaps the best approach would be to analyze some open source network stack implementation in order to see what's really going on... – Yoel Nov 15 '14 at 22:12
Thank you for your time and thanks for your efforts to make me understand this. I have this problem in demonstrating this. I established TCP connection between server and client, and they are continuously sending data. Meanwhile I inject an ICMP fragmentation needed message towards server but server is still fragmenting the packets at 1500 bytes even though i'm reporting low MTU as 1300. Is it ignoring ICMP or I'm doing something wrong? – Aaron88 Nov 16 '14 at 01:41
I would guess there is something wrong with the injection process. Please update your question to show how you inject packets. – Yoel Nov 16 '14 at 06:34
Hi Yoel, I have edited my question and added all the details. Please have a look. – Aaron88 Nov 16 '14 at 19:16
Thank you for providing all the details. Do you know how would I make ICMP to include IP and include first 8 bytes of datagram (source port and destination port). If I have understood what you said: IP + (ICMP+ip+First8Bytes). Where IP is what router generating. ip-what router received from host. First8Bytes-source port and destination port. Is that what you meant? – Aaron88 Nov 17 '14 at 01:16
I have implemented whatever you have said so far. Do you know how to suppress the acknowledgement? I would really appreciate your help. – Aaron88 Nov 17 '14 at 04:23
[Here](http://www.packetlevel.ch/html/scapy/download/pmtu.py) is a sample implementation of how the ICMP message should be built. Note that the first 8 bytes of the datagram's data span over more than just the ports and that you may include more than 8 bytes. If sending the ICMP message as a response to one of the TCP packets fails, I suggest you try sending the ICMP message before even receiving the TCP packet to which it relates, in order to assure it is received before the ACK. To be honest, I'm not sure that's mandatory. Only if that fails, try abolishing the ACK altogether. – Yoel Nov 17 '14 at 06:17
1

I have implemented everything and its still not working. I have posted a new question with all my implementation details here http://stackoverflow.com/questions/27027206/how-to-build-forged-icmp-destination-unreachable-type-3-code-4-packet. – Aaron88 Nov 19 '14 at 21:25
I've taken a look at your implementation. As I suggested in my previous comment and in my answer as well, I would try setting the sequence number of the original TCP header that is encapsulated in the ICMP error message (i.e. `tcp_orig.seq`) since the specification requires that at least 8 bytes of the problematic packet's IP layer payload are included in the ICMP error message. – Yoel Nov 19 '14 at 23:08
I did use the sequence number as well, sorry I didn't mention it in my question.I used tcp_orig.seq=1, because thats what the sequence number I received at client when I checked in Wireshark. I doubt the sequence number is 1. Does the sequence number for first message is always 1? – Aaron88 Nov 20 '14 at 02:26
I got one step closer, found a way to find real sequence number but it seems like it is still not working.Does IP header needs to be exactly same as what I have received or just chnaging src IP and dst IP is ok? – Aaron88 Nov 20 '14 at 02:52
I think it should be exactly as it was received. At least that's what the specification requires. – Yoel Nov 20 '14 at 06:38
I'm setting the path MTU discovery on both server and client using – Aaron88 Nov 21 '14 at 00:37
I'm setting path MTU discovery on both server and client using:sysctl -w net.ipv4.ip_no_pmtu_disc=1. But whenever I capture packets exchanged between two, IP header has no DF bit set. This might be causing the systems to not respong to ICMP fragmentation needed message. Your say on this? – Aaron88 Nov 21 '14 at 00:37
Actually, with `sysctl -w net.ipv4.ip_no_pmtu_disc=1`, you're not [enabling path MTU discovery](https://www.frozentux.net/ipsysctl-tutorial/chunkyhtml/variablereference.html#AEN276), but rather disabling it and that's why the `DF` bit is unset. Note the `no` in `ip_no_pmtu_disc`. Try setting it to 0. On an unrelated note, verify that there isn't any firewall or `iptables` rule that blocks ICMP messages. – Yoel Nov 21 '14 at 09:36
Enabled path MTU as you mentioned above but it is still now working. Thanks for all your help Yoel. I really appreciate. – Aaron88 Nov 23 '14 at 03:20
Awesome, I'll try writing a decent answer to your other question as well for future reference. Please consider [accepting](http://meta.stackexchange.com/q/5234) these answers by clicking the check-mark that is to the left of them. Note that there is absolutely no obligation to do this. – Yoel Nov 23 '14 at 09:02
It finally worked. Thanks for helping me.By the I don't know how to accept the answer, I try to vote up but it shows I need minimum of 15 reputation. – Aaron88 Nov 25 '14 at 22:50

score 1 · Answer 2 · answered Aug 04 '15 at 09:12

The way i understand it, the host receives a "ICMP Fragmentation needed and DF set" but the message can come from a intermediate device(router) in the path, thus the host cant directly matched the icmp response with a current session, the icmp only contains the destination ip and mtu limit.

The host then adds a entry to the routing table for the destination ip that records the route and mtu with a expiry of 10min.

This can be observed on linux by asking for the specific route with ip route get x.x.x.x after doing a tracepath or ping that triggers the icmp response.

$ ip route get 10.x.y.z
10.z.y.z via 10.a.b.1 dev eth0  src 10.a.b.100 
cache  expires 598sec mtu 1300

Regarding ICMP "Fragmentation needed, DF bit set" or ICMP packet too big message

2 Answers2

Linked