Compact data structure for storing parsed log lines in Go (i.e. compact data structure for multiple enums in Go)

Question

I'm working on a script that parses and graph information from a database logfile. Some examples loglines might be:

Tue Dec  2 03:21:09.543 [rsHealthPoll] DBClientCursor::init call() failed
Tue Dec  2 03:21:09.543 [rsHealthPoll] replset info example.com:27017 heartbeat failed, retrying
Thu Nov 20 00:05:13.189 [conn1264369] insert foobar.fs.chunks ninserted:1 keyUpdates:0 locks(micros) w:110298 110ms
Thu Nov 20 00:06:19.136 [conn1263135] update foobar.fs.chunks query: { files_id: ObjectId('54661657b23a225c1e4b00ac'), n: 0 } update: { $set: { data: BinData } } nscanned:1 nupdated:1 keyUpdates:0 locks(micros) w:675 137ms
Thu Nov 20 00:06:19.136 [conn1258266] update foobar.fs.chunks query: { files_id: ObjectId('54661657ae3a22741e0132df'), n: 0 } update: { $set: { data: BinData } } nscanned:1 nupdated:1 keyUpdates:0 locks(micros) w:687 186ms
Thu Nov 20 00:12:14.859 [conn1113639] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1416453003000|74 } } cursorid:7965836327322142721 ntoreturn:0 keyUpdates:0 numYields: 15 locks(micros) r:351042 nreturned:3311 reslen:56307 188ms

Not every logline contains all fields, but some of the fields we parse out include:

Datetime
Query Duration
Name of Thread
Connection Number (e.g. 1234, 532434, 53433)
Logging Level (e.g. Warning, Error, Info, Debug etc.)
Logging Component (e.g. Storage, Journal, Commands, Indexin etc.)
Type of operation (e.g. Query, Insert, Delete etc.)
Namespace

The total logfile can often be fairly large (several hundred MBs up to a coupe of GBs). Currently the script is in Python, and as well as the fields, it's also storing the original raw logline as well as a tokenised version - the resulting memory consumption though is actually several multiples of the original logfile size. Hence, memory consumption is one of the main things I'd like to improve.

For fun/learning, I thought I might try re-doing this in Go, and looking at whether we could use a more compact data structure.

Many of the fields are enumerations (enums) - for some of them the set of values is known in advance (e.g. logging leve, logging component). For others (e.g. name of thread, connection number, namespace), we'll work out the set at runtime as we parse the logfile.

Planned Changes

Firstly, many of these enums are stored as strings. So I'm guessing one improvement will be move to using something like an uint8 to store it, and then either using consts (for the ones we know in advance), or having some kind of mapping table back to the original string (for the ones we work out.) Or are there any other reaosns I'd prefer consts versus some kind of mapping structure?

Secondly, rather than storing the original logline as a string, we can probably store an offset back to the original file on disk.

Questions

Do you see any issues with either of the two planned changes above? Are these a good starting point?
Do you have any other tips/suggestions for optimising the memory consumption of how we store the loglines?
I know for bitmaps, there's things like Roaring Bitmaps (http://roaringbitmap.org/), which are compressed bitmaps which you can still access/modify normally whilst compressed. Apparently the overall term for things like this is succinct data structures. However, are there any equivalents to roaring bitmaps but for enumerations? Or any other clever way of storing this compactly?
I also thought of bloom filters, and maybe using those to store whether each logline was in a set (i.e. logging level warning, logging level error) - however, it can only be in one of those sets, so I don't know if that makes sense. Also, not sure how to handle the false positives.

Thoughts?

It's unclear what you're asking Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. For example, provide a sample of log file lines. — peterSO, Jan 11 '15 at 21:38
@peterSO Sure, I've added headings, and tried to clarify the question a bit more - let me know if that's helpful. — victorhooi, Jan 11 '15 at 23:59
Regarding enums, you may want to define a type (`type LogLevel int`) and then use `iota` in your const definitions (`const (INFO LogLevel = iota` then, `DEBUG`, `ERROR` etc.) See http://play.golang.org/p/PjrjSVIvrS https://golang.org/ref/spec#Iota — Intermernet, Jan 12 '15 at 00:00

Intermernet · Accepted Answer · 2015-01-12T06:47:19.897

Do you see any issues with either of the two planned changes above? Are these a good starting point?

No problems with either. If the logs are definitely line-delimited you can just store the line number, but it may be more robust to store the byte-offset. The standard io.Reader interface returns the number of bytes read so you can use that to gain the offset.

Do you have any other tips/suggestions for optimising the memory consumption of how we store the loglines?

It depends on what you want to use them for, but once they've been tokenized (and you've got the data you want from the line), why hold onto the line in memory? It's already in the file, and you've now got an offset to look it up again quickly.

are there any equivalents to roaring bitmaps but for enumerations? Or any other clever way of storing this compactly?

I'd tend to just define each enum type as an int, and use iota. Something like:

package main

import (
    "fmt"
    "time"
)

type LogLevel int
type LogComponent int
type Operation int

const (
    Info LogLevel = iota
    Warning
    Debug
    Error
)

const (
    Storage LogComponent = iota
    Journal
    Commands
    Indexin
)

const (
    Query Operation = iota
    Insert
    Delete
)

type LogLine struct {
    DateTime      time.Time
    QueryDuration time.Duration
    ThreadName    string
    ConNum        uint
    Level         LogLevel
    Comp          LogComponent
    Op            Operation
    Namespace     string
}

func main() {
    l := &LogLine{
        time.Now(),
        10 * time.Second,
        "query1",
        1000,
        Info,
        Journal,
        Delete,
        "ns1",
    }
    fmt.Printf("%v\n", l)
}

Produces &{2009-11-10 23:00:00 +0000 UTC 10s query1 1000 0 1 2 ns1}.

Playground

You could pack some of the struct fields, but then you need to define bit-ranges for each field and you lose some open-endedness. For example define LogLevel as the first 2 bits, Component as the next 2 bits etc.

I also thought of bloom filters, and maybe using those to store whether each logline was in a set (i.e. logging level warning, logging level error) - however, it can only be in one of those sets, so I don't know if that makes sense. Also, not sure how to handle the false positives.

For your current example, bloom filters may be overkill. It may be easier to have a []int for each enum, or some other master "index" that keeps track of line-number to (for example) log level relationships. As you said, each log line can only be in one set. In fact, depending on the number of enum fields, it may be easier to use the packed enums as an identifier for something like a map[int][]int.

Set := make(map[int][]int)
Set[int(Delete) << 4 + int(Journal) << 2 + int(Debug)] = []int{7, 45, 900} // Line numbers in this set.

See here for a complete, although hackish example.

Compact data structure for storing parsed log lines in Go (i.e. compact data structure for multiple enums in Go)

1 Answers1