File seek with two-byte characters

Question

I'm writing small log parser, which should find some tags in files. Files are large (512mb) and have the following structure:

[2018.07.10 00:30:03:125] VersionInfo\886
...some data...
[2018.07.10 00:30:03:109][TraceID: 8HRWSI105YVO91]->IncomingTime\16
...some data...
[2018.07.10 00:30:03:109][TraceID: 8HRWSI105YVO91]->IncomingData\397
...some data...
[2018.07.10 00:30:03:749][TraceID: 8HRWSI105YVO91]->OutgoingData\26651
...somedata...

Each block IncomingTime, IncomingData, OutgoingData, etc. has block size (characters count, not bytes) at the end 886, 16, 397, 26651. Some blocks are very large and can't be read without large buffer (if i use bufio). I want to skip unnecessary blocks using file.Seek.

The problem is that file.Seek needs bytes length and i've only characters count (block may have unicode data with two-byte charcters). Is there any chance to skip blocks using characters count?

score 2 · Accepted Answer · answered Jul 28 '18 at 10:25

The problem is that file.Seek needs bytes length and i've only characters count (block may have unicode data with two-byte charcters). Is there any chance to skip blocks using characters count?

That's actually impossible. As you've described the file format, both of the following are possible:

...VersionInfo\1
[ 20 ]
...VersionInfo\1
[ C2 A0 ]

If you've just read the newline and you know you need to read one character, you know it's somewhere between 1 and 2 bytes (UTF-8 characters can go up to 4 bytes even) but not which, and blindly launching forward some number of bytes without inspecting the intermediate data won't work. The pathological case is a larger block, where the first half has many multi-byte characters and the last half has text that happens to look like one of your entry headers.

With this file format you're forced to read it a character at a time.

File seek with two-byte characters

1 Answers1