2

At the moment I'm trying to write a quick Python program that reads in a .pcap file and writes out data about the various sessions that are stored in there.

The information I write out includes srcip, dstip, srcport and dstport etc.

However, even for a fairly small pcap this takes a lot of memory and ends up running for a very long time. We're talking 8GB+ of memory used for a pcap of a size of 212MB.

As usual I guess there might be a more efficient way of doing this that I'm just unaware of.

Here is a quick skeleton of my code - no vital parts missing.

import socket
from scapy.all import *


edges_file = "edges.csv"
pcap_file = "tcpdump.pcap"

try:
    print '[+] Reading and parsing pcap file: %s' % pcap_file
    a = rdpcap(pcap_file)

except Exception as e:
    print 'Something went wrong while opening/reading the pcap file.' \
          '\n\nThe error message is: %s' % e
    exit(0)

sessions = a.sessions()

print '[+] Writing to edges.csv'
f1 = open(edges_file, 'w')
f1.write('source,target,protocol,sourceport,destinationport,'
         'num_of_packets\n')
for k, v in sessions.iteritems():

    tot_packets = len(v)

    if "UDP" in k:
        proto, source, flurp, target = k.split()
        srcip, srcport = source.split(":")
        dstip, dstport = target.split(":")
        f1.write('%s,%s,%s,%s,%s,%s\n' % (srcip, dstip, proto, srcport,
                                          dstport, tot_packets))
        continue

    elif "TCP" in k:
        proto, source, flurp, target = k.split()
        srcip, srcport = source.split(":")
        dstip, dstport = target.split(":")
        f1.write('%s,%s,%s,%s,%s,%s\n' % (srcip, dstip, proto, srcport,
                                          dstport, tot_packets))
        continue

    elif "ICMP" in k:
        continue  # Not bothered about ICMP right now

    else:
        continue  # Or any other 'weird' pacakges for that matter ;)

print '[+] Closing the edges file'
f1.close()

As always - grateful for any assistance.

Swedish Mike
  • 583
  • 1
  • 8
  • 23
  • may be open/close file inside loop, with append flag – YOU Feb 10 '16 at 10:27
  • I'm sorry but I'm not following you here? Are you referring to the edges.csv file being opened/closed for each iteration in the `for k,v in sessions.iteritems()` loop? What benefit would that have? – Swedish Mike Feb 10 '16 at 10:32
  • i am guessing that f.write hold the memory until it close – YOU Feb 10 '16 at 10:34
  • I don't think that's the issue - I removed all the bits and pieces in regards to that and just did the `a = rdpcap(pcap_file)` portion and that resulted in the massive memory usage too. I might have to look at reading the file 'line by line and losing the ability to group it by sessions. – Swedish Mike Feb 10 '16 at 10:48
  • 1
    It seems to be the rdpcap function that generates the mahoosive memory usage. I've played around with different sizes of snaplength on the tcpdump that generates the .pcap file and running `-s 15` makes the file quite a bit smaller. That in turn uses up less memory. – Swedish Mike Feb 10 '16 at 11:20
  • oh i see, good to know that. – YOU Feb 10 '16 at 12:08

1 Answers1

4

I know I'm late to the party, but hopefully this will be useful to future visitors.

rdpcap() dissects the entire pcap file and retains an in-memory representation of each and every packet, which explains why it eats up a lot of memory.

As far as I am aware (I am a novice Scapy user myself), the only two ways you can invoke Scapy's session reassembly are:

  1. By calling scapy.plist.PacketList.sessions(). This is what you're currently doing (rdpcap(pcap_file) returns a scapy.plist.PacketList).
  2. By reading the pcap using sniff() in offline mode while also providing the function with a session decoder implementation. For example, for TCP reassembly you'd do sniff(offline='stackoverflow.pcap', session=TCPSession). (This was added in Scapy 2.4.3).

Option 1 is obviously a dead end (as it requires that we keep all packets of all sessions in memory at one time), so let's explore option 2...

Let's launch Scapy in interactive mode to access the documentation for sniff():

$ scapy
>>> help(sniff)

Help on function sniff in module scapy.sendrecv:

sniff(*args, **kwargs)
    Sniff packets and return a list of packets.
    
    Args:
        count: number of packets to capture. 0 means infinity.
        store: whether to store sniffed packets or discard them
        prn: function to apply to each packet. If something is returned, it
             is displayed.
             --Ex: prn = lambda x: x.summary()
        session: a session = a flow decoder used to handle stream of packets.
                 e.g: IPSession (to defragment on-the-flow) or NetflowSession
        filter: BPF filter to apply.
        lfilter: Python function applied to each packet to determine if
                 further action may be done.
                 --Ex: lfilter = lambda x: x.haslayer(Padding)
        offline: PCAP file (or list of PCAP files) to read packets from,
                 instead of sniffing them
        timeout: stop sniffing after a given time (default: None).
        L2socket: use the provided L2socket (default: use conf.L2listen).
        opened_socket: provide an object (or a list of objects) ready to use
                      .recv() on.
        stop_filter: Python function applied to each packet to determine if
                     we have to stop the capture after this packet.
                     --Ex: stop_filter = lambda x: x.haslayer(TCP)
        iface: interface or list of interfaces (default: None for sniffing
               on all interfaces).
        monitor: use monitor mode. May not be available on all OS
        started_callback: called as soon as the sniffer starts sniffing
                          (default: None).
    
    The iface, offline and opened_socket parameters can be either an
    element, a list of elements, or a dict object mapping an element to a
    label (see examples below).

Notice the store parameter. We can set this to False to make sniff() operate in a streamed fashion (read a single packet, process it, then release it from memory):

sniff(offline='stackoverflow.pcap', session=TCPSession, store=False)

I just tested this with a 193 MB pcap. For store=True (default value), this eats up about 1.7 GB of memory on my system (macOS), but only approximately 47 MB when store=False.

Processing the reassembled TCP sessions (open question)

So we managed to reduce our memory footprint - great! But how do we process the (supposedly) reassembled TCP sessions? The usage instructions indicates that we should use the prn parameter of sniff() to specify a callback function that will then be invoked with the reassembled TCP session (emphasis mine):

sniff() also provides Sessions, that allows to dissect a flow of packets seamlessly. For instance, you may want your sniff(prn=...) function to automatically defragment IP packets, before executing the prn.

The example is in the context of IP fragmentation, but I'd expect the TCP analog to be to group all packets of a session and then invoke prn once for each session. Unfortunately, that's not how it works: I tried this on my example pcap, and the callback is invoked once for every packet---exactly as indicated in sniff()'s documentation shown above.

The usage instructions linked above also states the following about using session=TCPSession in sniff():

TCPSession -> defragment certain TCP protocols*. Only HTTP 1.0 currently uses this functionality.

With the output of the experiment above in mind, I now interpret this as that whenever Scapy finds an HTTP (1.0) request/response that spans across multiple TCP segments, it'll create a single packet in which the payload is the merged payload of those TCP segments (which in total is the full HTTP request/response). I'd appreciate it if anyone can help clarify the meaning of the above quote on TCPSession---or even better: clarify if TCP reassembly is indeed possible this way and that I'm just misunderstanding the API.

Janus Varmarken
  • 2,306
  • 3
  • 20
  • 42
  • 1
    `TCPSession` is very specific. It only does something if the layer you are dissecting (for instance, an HTTP packet), has a `tcp_reassemble` special function. For each packet that comes in, this function will be called with the reassembled TCP byte stream (using the sequence numbers), meaning the total data sent on this session (identified by port + IP). The `tcp_reassemble` is then supposed to tell if the packet is full or if there are still TCP fragments missing, and either build the packet and return it or wait for them. – Cukic0d Aug 07 '20 at 09:44
  • @Cukic0d Thank you very much for this insight. Sounds like it's actually more application layer reassembly than TCP reassembly. Is the session _only_ identified by `(src_ip, dst_ip, src_port, dst_port)`, or are you simplifying things? That doesn't account for corner cases, e.g., if the client by chance chooses the same ephemeral port number for two (temporally non-overlapping) connections to the same server. – Janus Varmarken Aug 07 '20 at 16:58
  • I've never heard of such a corner case. A TCP client is supposed to randomize the source port, this shouldn't happen. – Cukic0d Aug 07 '20 at 23:13