1

I need to read about 600 pcap files, each file is about 100MB. I use gopacket to load pcap file, and check it.

Case1: uses 1 routine to check.

Case2: uses 40 routines to check.

And I found that the time consumed by case1 and case2 are similar. The difference is cpu usage of case1 only has 200%, and case2 can reach to 3000%. My question is why multiple routines cannot improve performance? There are some comments in code, hope that will help.

package main

import (
    "flag"
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "strings"
    "sync"

    "github.com/google/gopacket"
    "github.com/google/gopacket/layers"
    "github.com/google/gopacket/pcap"
)

func main() {
    var wg sync.WaitGroup

    var dir = flag.String("dir", "../pcap", "input dir")
    var threadNum = flag.Int("threads", 40, "input thread number")
    flag.Parse()
    fmt.Printf("dir=%s, threadNum=%d\n", *dir, *threadNum)

    pcapFileList, err := ioutil.ReadDir(*dir)
    if err != nil {
        panic(err)
    }

    log.Printf("start. file number=%d.", len(pcapFileList))

    fileNumPerRoutine := len(pcapFileList) / *threadNum
    lastFileNum := len(pcapFileList) % *threadNum

    // split files to different routine
    // each routine only process files which belong to itself
    if fileNumPerRoutine > 0 {
        for i := 0; i < *threadNum; i++ {
            start := fileNumPerRoutine * i
            end := fileNumPerRoutine * (i + 1)
            if lastFileNum > 0 && i == (*threadNum-1) {
                end = len(pcapFileList)
            }
            // fmt.Printf("start=%d, end=%d\n", start, end)
            wg.Add(1)
            go checkPcapRoutine(i, &wg, dir, pcapFileList[start:end])
        }
    }

    wg.Wait()
    log.Printf("end.")
}

func checkPcapRoutine(id int, wg *sync.WaitGroup, dir *string, pcapFileList []os.FileInfo) {
    defer wg.Done()

    for _, p := range pcapFileList {
        if !strings.HasSuffix(p.Name(), "pcap") {
            continue
        }
        pcapFile := *dir + "/" + p.Name()
        log.Printf("checkPcapRoutine(%d): process %s.", id, pcapFile)

        handle, err := pcap.OpenOffline(pcapFile)
        if err != nil {
            log.Printf("error=%s.", err)
            return
        }
        defer handle.Close()

        packetSource := gopacket.NewPacketSource(handle, handle.LinkType())

        // Per my test, if I don't parse packets, it is very fast, even use only 1 routine, so IO should not be the bottleneck.
        // What puzzles me is that every routine has their own packets, each routine is independent, but it still seems to be processed serially.
        // This is the first time I use gopacket, maybe used wrong parameter?
        for packet := range packetSource.Packets() {
            gtpLayer := packet.Layer(layers.LayerTypeGTPv1U)
            lays := packet.Layers()
            outerIPLayer := lays[1]
            outerIP := outerIPLayer.(*layers.IPv4)

            if gtpLayer == nil && (outerIP.Flags&layers.IPv4MoreFragments != 0) && outerIP.Length < 56 {
                log.Panicf("file:%s, idx=%d may leakage.", pcapFile, j+1)
                break
            }
        }
    }
}
Jacky
  • 73
  • 1
  • 7
  • 1
    Before trying to parallelize something, it's a good idea to know what the bottleneck is. In your case, it's likely reading from disk. If so, attempting to read files in parallel won't help. Profile your program first, then figure out what should be optimized. – Marc Aug 20 '20 at 09:28
  • Thanks for your reply, I was test on a physical RedHat server, each pcap file only 100MB, disk IO should not be bottleneck. – Jacky Aug 20 '20 at 09:40
  • 1
    Did you profile your code? Did you figure out what was the bottleneck? If not, please do so. – Marc Aug 20 '20 at 09:42

1 Answers1

4

To run two or more tasks in parallel the operations required to carry out those tasks must have a property of not being dependent on each other or some external resources which are then said to be shared by those tasks.

In real world, the tasks which are truly and completely independent are rare (so rare that there is even a dedicated name for the class of such tasks: they are said to be embarrasingly parallel) but when the amount of dependency of tasks on each other's progression and contending to access shared resources is below some threshold, adding more "workers" (goroutines) may improve the total time it takes to complete a set of tasks.

Notice "may" here: for instance, your storage device and the file system on it and the kernel data structures and code to work with the filesystem and the storage device is a shared medium all your goroutines have to access. This medium has a certain limit on both its throughput and latency; basically, you can only read, like, M bytes per second from that medium—and whether you have a single reader fully utilizing this bandwidth, or N readers—each utilizing some amount around M/N of it—does not matter: you physically cannot read faster than that limit of M BPS.

Moreover, the resources most frequently found in real world tend to degrade their performance when contended for: say, if the resource has to be locked to be accessed, the more accessors actively wanting to take the lock you have, the more CPU time is spent in the lock management code (when the resource is more complicated — such as that conglomerate of intricate stuff which "an FS on a storage device—all managed by the kernel" is — the analysis of how it degrades when is being accessed concurrently becomes way more complicated).

TL;DR

I can make an educated guess that your task is simply I/O-bound as the goroutines have to read the files.

You can verify that by modifying the code to first fetch all the files into memory and then handing the buffers to the parsing goroutines.

The drastic amount of CPU spent you're observing in your case is a red herring: contemporary systems like to take 100% CPU utilization to mean "full utilization of a single hardware processing thread" — so if you have, like 4 CPU cores with HyperThreading™ (or whatever AMD has for this) enabled, the full capacity of your system is 4×2=8, or 800%.
The fact you're may be seeing more than the theoretical capacity (which we do not know) may be explained by your system showing that way the so-called "starvation": you have many software threads wanting to be executed but waiting for their CPU time, and the system showing that as insane CPU utilization.

kostix
  • 51,517
  • 14
  • 93
  • 176