4

I am a new-grad SWE learning Go (and loving it).

I am building a parser for Wikipedia dump files - basically a huge bzip2-compressed XML file (~50GB uncompressed).

I want to do both streaming decompression and parsing, which sounds simple enough. For decompression, I do:

inputFilePath := flag.Arg(0) inputReader := bzip2.NewReader(inputFile)

And then pass the reader to the XML parser:

decoder := xml.NewDecoder(inputFile)

However, since both decompressing and parsing are expensive operations, I would like to have them run on separate Go routines to make use of additional cores. How would I go about doing this in Go?

The only thing I can think of is wrapping the file in a chan []byte, and implementing the io.Reader interface, but I presume there might be a built way (and cleaner) way of doing it.

Has anyone ever done something like this?

Thanks! Manuel

Manuel Menzella
  • 417
  • 1
  • 4
  • 9

2 Answers2

2

You can use io.Pipe, then use io.Copy to push the decompressed data into the pipe, and read it in another goroutine:

package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "io"
    "sync"
)

func main() {

    rawJson := []byte(`{
            "Foo": {
                "Bar": "Baz"
            }
        }`)

    bzip2Reader := bytes.NewReader(rawJson) // this stands in for the bzip2.NewReader

    var wg sync.WaitGroup
    wg.Add(2)

    r, w := io.Pipe()

    go func() {
        // write everything into the pipe. Decompression happens in this goroutine.
        io.Copy(w, bzip2Reader)
        w.Close()
        wg.Done()
    }()

    decoder := json.NewDecoder(r)

    go func() {
        for {
            t, err := decoder.Token()
            if err != nil {
                break
            }
            fmt.Println(t)
        }
        wg.Done()
    }()

    wg.Wait()
}

http://play.golang.org/p/fXLnfnaWYA

user1431317
  • 2,674
  • 1
  • 21
  • 18
  • 1
    This is exactly what I needed, thanks! Unfortunately, it seems the performance of the stardard lib bzip2 decompressor is not great, so it is still the limiting factor. I may switch to this compressor: https://godoc.org/github.com/dsnet/compress/bzip2 However, it is still about 33% slower than something like pbzip2. – Manuel Menzella Mar 26 '16 at 09:27
  • How much of a speed-up did you get in the end, @ManuelMenzella? I like the look of this code - it seems like it should work, but in my testing it's only marginally faster than doing everything single-threaded (67 sec vs 72 sec on 1M records). Any idea what I could be doing wrong, @user1431317? – EM0 Jul 12 '17 at 20:07
  • Maybe it's still limited by how fast the bzip2 decompression can feed it data, and the xml decoding isn't taking up that much cpu power. The pipe probably adds some overhead, although io.Copy does have optimizations for when one or both ends are io.Reader/io.Writer. It's possible that it's allocating a lot of small temporary buffers, and that that's causing too much garbage. Maybe a buffered reader or writer would help. You should profile your app (both cpu and mem profile - mem profile can help you find lots of unnecessary allocations). – user1431317 Jul 13 '17 at 16:24
  • The bzip2 decompressor and parser take about the same amount of time - I tested that by running them separately. I've also tried buffering everything imaginable and it hasn't helped. Well, everything except the io.Pipe and I suspect this *could* be the problem. I've posted a separate question about that: https://stackoverflow.com/questions/45089248/buffered-version-of-go-io-pipe – EM0 Jul 13 '17 at 19:16
  • Buffering the PipeReader and PipeWriter didn't help, either. Hmm... Have you (or has anyone else) tried this and found a significant speed-up over the single-threaded version? – EM0 Jul 13 '17 at 20:08
  • io.Copy actually uses a 32k buffer. Until that buffer is full, it will block waiting for the bzip2 reader. Perhaps the problem is that that buffer is too big. You want to feed the xml decoder with data as soon as it is available. Here's a simulated example that uses a tiny buffer and actually shows a large improvement: https://play.golang.org/p/lkwwBBR8jx Of course you don't want the buffer to be too small, because that adds too much overhead (switching between goroutines too often). The example uses a tiny json, so you'll have to tune the buffer size to your actual data. – user1431317 Jul 13 '17 at 21:24
  • I tried that, but a smaller buffer makes it slower. 32K seems to be about optimal. Interestingly, the standard library bzip2 package is slower then the dsnet one on a single thread, but faster when using 2 goroutines like this! The stdlib one takes 44.6 seconds single-threaded, 34.6 with goroutines. The dsnet one takes 40.7 seconds single-threaded, 38.6 with goroutines. (Tested this multiple times.) – EM0 Jul 14 '17 at 11:34
0

An easy solution is to use a readahead package I created some time back: https://github.com/klauspost/readahead

inputReader := bzip2.NewReader(inputFile)
ra := readahead.NewReader(input)
defer ra.Close()

And then pass the reader to the XML parser:

decoder := xml.NewDecoder(ra)

With default settings it will decode up to 4MB ahead of time in 4 buffers.

sh0dan
  • 160
  • 8