5

I have to read and parse .pcap files that are too large to load into memory. I am currently using sniff in offline mode

sniff(offline=file_in, prn=customAction, store=0)

with a customAction function that looks roughly like this:

customAction(packet):
    global COUNT
    COUNT = COUNT + 1
    # do some other stuff that takes practically 0 time

Currently this processes packets too slowly. I am already using subprocess in a 'driver' program to run this script on multiple files simultaneously on different cores but I really need to improve single core performance.

I tried using pypy and was disappointed that performance using pypy less than 10% better than using python3 (anaconda).

Average time to run 50k packets using pypy is 52.54 seconds

Average time to run 50k packets using python3 is 56.93 seconds

Is there any way to speed things up?

EDIT: Below is the result of cProfile, as you can see the code is a bit slower while being profiled but all of the time is spent doing things is scapy.

66054791 function calls (61851423 primitive calls) in 85.482 seconds

Ordered by: cumulative time

ncalls            tottime  percall  cumtime  percall filename:lineno(function)
957/1             0.017    0.000    85.483   85.483  {built-in method builtins.exec}
    1             0.001    0.001    85.483   85.483  parser-3.py:1(<module>)
    1             0.336    0.336    83.039   83.039  sendrecv.py:542(sniff)
50001             0.075    0.000    81.693    0.002  utils.py:817(recv)
50001             0.379    0.000    81.618    0.002  utils.py:794(read_packet)
795097/50003      3.937    0.000    80.140    0.002  base_classes.py:195(__call__)
397549/50003      6.467    0.000    79.543    0.002  packet.py:70(__init__)
397545/50000      1.475    0.000    76.451    0.002  packet.py:616(dissect)
397397/50000      0.817    0.000    74.002    0.001  packet.py:598(do_dissect_payload)
397545/200039     6.908    0.000    49.511    0.000  packet.py:580(do_dissect)
199083            0.806    0.000    32.319    0.000  dns.py:144(getfield)
104043            1.023    0.000    22.996    0.000  dns.py:127(decodeRR)
397548            0.343    0.000    15.059    0.000  packet.py:99(init_fields)
397549            6.043    0.000    14.716    0.000 packet.py:102(do_init_fields)
6673299/6311213   6.832    0.000    13.259    0.000  packet.py:215(__setattr__)
3099782/3095902   5.785    0.000    8.197    0.000  copy.py:137(deepcopy)
3746538/2335718   4.181    0.000    6.980    0.000  packet.py:199(setfieldval)
149866            1.885    0.000    6.678    0.000  packet.py:629(guess_payload_class)
738212            5.730    0.000    6.311    0.000  fields.py:675(getfield)
1756450           3.393    0.000    5.521    0.000  fields.py:78(getfield)
49775             0.200    0.000    5.401    0.000  dns.py:170(decodeRR)
1632614           2.275    0.000    4.591    0.000  packet.py:191(__getattr__)
985050/985037     1.720    0.000    4.229    0.000  {built-in method builtins.hasattr}
326681/194989     0.965    0.000    2.876    0.000  packet.py:122(add_payload)
...

EDIT 2: Full code example:

from scapy.all import *
from scapy.utils import PcapReader
import time, sys, logging


COUNT    = 0
def customAction(packet):
global COUNT
COUNT = COUNT + 1

file_temp = sys.argv[1]
path      = '/'.join(file_temp.split('/')[:-2])
file_in   = '/'.join(file_temp.split('/')[-2:])
name      = file_temp.split('/')[-1:][0].split('.')[0]


os.chdir(path)
q_output_file = 'processed/q_' + name + '.csv'
a_output_file = 'processed/a_' + name + '.csv'
log_file      = 'log/' + name + '.log'

logging.basicConfig(filename=log_file, level=logging.DEBUG)

t0=time.time()
sniff(offline=file_in, prn=customAction, lfilter=lambda x:x.haslayer(DNS), store=0)
t1=time.time()

logging.info("File '{}' took {:.2f} seconds to parse {} packets.".format(name, t1-t0, COUNT))
deltap
  • 4,176
  • 7
  • 26
  • 35
  • According to https://gist.github.com/dpifke/2244911 defaultdict should not be the reason that pypy is slower than python3. By process of elimination it seems that the obvious culprit is scapy + pypy. – deltap Jul 26 '16 at 23:04
  • Can you provide a concrete example for PyPy? If it is really slower (even for large files) it might be because of some missing optimization inside PyPy itself, which we could fix. – Armin Rigo Jul 27 '16 at 17:14
  • After some code tweaking I can now eek out a 10% improvement in execution speed using pypy. – deltap Jul 27 '16 at 21:15
  • @ArminRigo After some more testing and benchmarking I've determined that the code run in the customAction has practically no impact on the timing for pypy or for regular python3. I checked this by stripping it out and simply returning a counter of how many packets were processed. The culprit is the sniff() call in scapy. Whatever code lives behind that call doesn't do well with pypy. – deltap Jul 27 '16 at 21:45
  • 1
    Have you un this through a profiler to see where you're _actually_ spending time? – MatsLindh Jul 27 '16 at 21:49
  • @MatsLindh I have used the time library to implement timing. All of the time is being spent calling sniff(). Within sniff() I do not know where the bulk of the time is spent. The source for sniff() seems to be here: https://github.com/secdev/scapy/blob/81f10d9bb0e5360fb24366bb0813c5cf4c51c74c/scapy/sendrecv.py – deltap Jul 27 '16 at 21:55
  • Can you provide a concrete example? Like, a loop that calls sniff() with the same dummy arguments for about ten seconds, where sniff() can be installed from the github package. For us, it would be useful as a starting point to have something that runs. – Armin Rigo Jul 28 '16 at 21:39
  • I've provided example code you need to find a .pcap file online or generate one. – deltap Jul 29 '16 at 00:11

2 Answers2

1

It seems that scapy causes PyPy's JIT warm-up times to be high, but the JIT is still working if you run for long enough. Here are the results I got (on Linux 64):

size of .pcap        CPython time        PyPy time
2MB                  4.9s                7.3s
5MB                  15.3s               9.1s
15MB                 1m15s               21s
Armin Rigo
  • 12,048
  • 37
  • 48
  • How many packets are in those files? My files with 50,000 packets are approximately 8MB and my run times are significantly longer than yours. – deltap Jul 29 '16 at 16:00
  • I've seen in the logs numbers closer to a dozen "packets". Again, it means that you're not telling us everything we need to know. Please provide a complete example; it can be some public .pcap file available on the internet. – Armin Rigo Jul 29 '16 at 19:51
0

I think that the short answer is that Scapy is just slow as hell. I tried just scanning a pcap file with sniff() or PcapReader, and not doing anything with the packets. The process was reading less than 3MB/s from my SSD, and the CPU usage was 100%. There are other pcap reader libraries for Python out there. I'd suggest experimenting with one of those.