Getting one repeated field from a file instead of needing to load the whole file

Question

Lets say I have a .proto structured (simplified) like this

Message DataItem {
  required string name = 1;
  required int32 value = 2;
}

Message DataItemStream {
  repeated DataItem items = 1;
}

The server will make the DataItemStream and write it to disk. We load this file and everything is happy without issue.

This worked pretty well for us but our client base has grown and so has the use of the software that generates the streams of files.

The problem arises as the repeated items field can have 10's of thousands of items but we're only interested in a subset of them. We've dug around a little bit and have only seen solutions that follow google's streaming advice (to add a size prefix to our stored DataItems and then parse each message individually OR to use a CodedInputStream/CodedOutputStream or to encode the binary wire format(base64) and separate by newline, then we'd be able to very easily get just the subsets we're interested in.

Any of these would work for us but require some changes in production code to change the way the files are saved (server based code that hasn't been changed in a long time and is deemed virtually untouchable by their management(in their minds, don't fix it if it isn't broken)...)

We've already re-created the module for the server that streams the messages differently, but are receiving flak from those maintainers about pushing our changes. It's much easier(politically) for us to change our code as needed as we have full control over its development cycle.

Is there a way to still use this original stream of messages but be intelligent on only picking subsets of messages to load? (we really do not care what language we have to work in if that matters, we have experience in c++, python, java and .NET (in that order of experience))

Our current work around is to have our code pull in huge files (lots of them...) and essentially pray that we do not run out of ram... RAM's cheap and we haven't hit a wall... but we will (if everything continues to scale/grow as it has) in roughly 2 months... — g19fanatic, Dec 06 '12 at 14:01
How do you determine whether an item is in the subset or not? — James Kanze, Dec 06 '12 at 14:26
The files are saved based upon a time interval and the file names are stamped as such. We know 'when' we're interested in the data so we know which file to look in and 'roughly' which section we would want in that file. — g19fanatic, Dec 06 '12 at 14:30
After you read a file are you done with it, or might you read it again? If multiple reads, build an index on the file during the first read and use the index to speed up later reads. — brian beuning, Dec 07 '12 at 04:26
It's a one time thing as the worker processes enter it's analysis to a database. After the files are parsed they are deleted. — g19fanatic, Dec 07 '12 at 11:54

Andrew Alcock · Accepted Answer · 2012-12-11T13:26:14.530

I would look at this as a database problem: You have a file representing a table (DataItemStream) with individual records (DataItems). You appear to want to pick contiguous ranges of DataItems from the table. This means the order of DataItems in the DataItemStream is important and in fact encodes a hidden primary key - the 'array' index aka row number of the DataItem in the DataItemStream.

In most databases, and in the array data structure, each row (or array item) occupies the same amount of space, so accessing the nth item is easy. However, the DataItems placed in the DataItemStream are of variable length, so this simple approach can't work.

Using the database metaphor, another way to seek records efficiently is to have an index - essentially another table, but much smaller that contains pointers into the main data structure. Indexes are normally structured as a table of (PK, pointer) tuples. In this case you could have an index file that is essentially a memory-mapped array of int32's. Each value in the index points to the byte offset in the data file where that DataItem record starts.

For example, if the data file were 1m records long, your index would be 4MB (1m records * len(int32) = 1m * 4 bytes). If you need to scan the data file for records 777777 to 888888, you:

Read the index to get the byte range of interest in the DataItemStream. Note that the seek operation is very fast indeed:
1. Open the index file
2. Seek (eg in Java RandomAccessFile.seek(), in Python fileObject.seek()) the starting index int32 (777777*4) and read it. This is the starting byte offset
3. Seek the ending index int32 (888888*4) and read it. This is the ending byte offset
4. Close the index file
Read the byte range of the DataItemStream file specified by the index:
1. Open the DataItemStream file
2. Seek the starting byte offset in the file
3. Read the stream until the ending byte offset (remember to subtract 1)
4. Close the DataItemStream file

A slightly different approach for 2. about could be to first create a new file for the specified byte range. This file now consists of only those records of interest.

How does the index file get created?

EDIT: description of the PB format: The construction of the actual index file can be generated by a simple pass over the data file. All fields start with a byte, and the message type is followed by segments. is encoded in a 'special' way using the MSB of each byte as a continuation signal as described here. This means that almost all the complexity of the data format can be avoided and the indexer can therefore be quite simple.

You could treat the index file as a cache - your code library could use an up-to-date index if present, or automatically create it if it is missing.

This approach allows code that is index-aware to proceed efficiently, and does not change the data format for any legacy programs.

The issue with this solution is that the index file doesn't exist. And to create this file i would need to load the whole file into memory (the only current method when using protocol buffers and streaming a repeated field to disk...), thus bringing me back to our current solution. — g19fanatic, Dec 10 '12 at 21:19
Could you share a small section of the DataItem data file? If you do, I'll write an example that will create an index file, but does not load the all the records into memory. Thanks. — Andrew Alcock, Dec 11 '12 at 00:59
This might be the first 1k bytes - don't worry about splitting the file at a record/message boundary — Andrew Alcock, Dec 11 '12 at 02:24
Unfortunately I do not have permission to provide a sample (not our data, not our definition..). That being said, I'm awarding you the bounty for the closet answer (so far). At this point, we've pushed forward and received permission (after some long nights of integration testing in our staging branch) to push our updates to the production server code. Still keeping the question unanswered until a full and correct solution is provided. — g19fanatic, Dec 16 '12 at 08:34

score 0 · Answer 2 · answered Dec 06 '12 at 16:48

0

Given your response to my question in comments: what is the format of the file? Can you synchronize to the beginning of an item after seeking to an arbitrary location?

Faced with a similar situation (log files over a gigabyte, looking for log entries in a specific time interval), we ended up doing a binary search on the file. This is a bit tricky, and the code isn't really portable, but the basic idea is to determine the length of the file (using stat under Unix, or its equivalent under Windows), then open the file (in binary mode under Windows); seek to the middle, scan ahead for the start of the next record, then compare with what we were looking for, and so on, determing the next seek location by whether we were behind or in front of where we wanted. This works if you can find the start of a record when starting from an arbitrary location (which in turn depends on the format of the file).

answered Dec 06 '12 at 16:48

James Kanze

150,581
18
184
329

The format of the file is given in the question. It is a protocol buffer of the defined above `DataItemStream` type, serialized and saved. We're not looking to re-create our own message packing scheme and would like to stick with protocol buffers. That being said, if you can add more info to your answer about how to 'skip' serialized repeated elements which are stored in binary format...It would probably be acceptable. – g19fanatic Dec 06 '12 at 18:03
Protocol buffers aren't really designed for this; they generally have to be read sequentially, since there's no way to synchronize at a record boundary. – James Kanze Dec 06 '12 at 20:08
I concur that they aren't really meant for this and our proposed 'fix' solves this issue (we did the base64 serialization of an item which is then split by newline) but we're trying to find a way to do this with the current production server and not the one that is going to be a pain to get pushed to production... – g19fanatic Dec 06 '12 at 20:25

alexis · Answer 3 · 2012-12-07T10:02:44.337

0

If your main bottleneck is RAM rather than speed of disk access, why not insert a proxy/filter that reads through the entire file message by message (I mean DataItem messages), but only retains and forwards the parts you're interested in? Sounds like it could buffer the entire part you are interested in without risking an overflow.

You can additionally improve the proxy to seek into the buffer file, if you can figure out how to detect the start of a message mid-stream, but that's a performance improvement that won't affect the interface of the proxy to the rest of the pipeline, or the maximal memory footprint.

Your problem is that since the DataItem messages are embedded in a DataItemStream, the google API forces you to load the entire DataItemStream in one go. To avoid this, you could write some ugly code that skips the DataItemStream envelope and exposes the sequence of DataItem as if they were unembedded. It'll depend on the internals of the PB serialization, but since your client puts a premium on stability, you can count on it not changing any time soon. If and when it does change, it will be time to push your preferred message layout and switch to the solution you've already developed.

If the actual message format is not significantly more complex than what you show (e.g., not too many optional fields or multiple levels of embedded messages), it should be straightforward to navigate the file layout using CodedInputStream (without changing the current file layout). Reading the first DataItem should be just a matter of seeking to its start with skipRawBytes() and skipField(), then reading each message with readMessage().

But I understand that the google API is not designed to read sequences of messages from a single stream, so this is probably too simplistic. If so, you should still be able to find the offset to field 1 of the first DataItem, read it with readString(), advance as necessary to the start of field 2, read it with readInt32(), etc. Your proxy can then discard the message or reassemble it and pass it on to the rest of the code.

I imagine you've thought of something along these lines already, so perhaps this approach is unfeasible or undesirable for some reason? Or maybe it's just so ugly that you'd rather deal with the political cost of changing the file layout...

edited Dec 07 '12 at 10:02

answered Dec 06 '12 at 18:30

alexis

48,685
16
101
161

our main concern is always performance. I mentioned the RAM issue in the comments as that is the current way that we are handling the problem (aka ignoring it and just loading the whole file and then finding the one we're interested in after it has been deserialized). Without a PB specific solution to this issue,we're going to move forward through the political mess to get our module pushed to production... but as I said above, this is unwanted and ultimately not the way we'd like to handle the problem. – g19fanatic Dec 06 '12 at 18:39
a lil more info on our backend. We use a django frontendn that uses customized extensions to generate these very large files. When the server is ready to switch to a new file, it notifies a broker that this file is ready for analysis. This broker (zmq based PUSH/PULL fan-out distribution) takes the filename and passes it to one of a dozen (or so depending on whats up at the moment) nodes for loading and processing. This happens this way so that we can very efficiently use the assets that we have almost immediately after the file is available. [cont'd] – g19fanatic Dec 06 '12 at 18:56
[cont'd] I'm not seeing how having a proxy that loads the file then spits out the relevant pieces would be any better than having the individual workers loading them and doing them. Your method would be more serial and take less ram while our method is more parallel and takes more ram. Ram is cheap as I've said and we can just increase the amount of ram/number of nodes... But solving the root of the issue will eliminate all of these issues... – g19fanatic Dec 06 '12 at 18:59
The proxy shouldn't load the entire file: It should discard most of the file as it reads it off the disk, keeping just the useful part. So it would need no longer than the disk fetch, and you end up with a smaller object. – alexis Dec 06 '12 at 19:14
The issue is that the server is saving the file in one large chunk as a PB of the above defined `DataItemStream` type. This is atomic and serialized as one message. Unless you know how to split this message into parts (PB solution that I am looking for), I am unable to do what you say. – g19fanatic Dec 06 '12 at 19:36
I see now; you need a way to ignore the `DataItemStream` wrapper and expose the content as a stream of `DataItem` messages. If the client is so set on stability, it's probably safe to bypass PB and hack something based on the actual file layout. (I don't use PB so I can't say more). – alexis Dec 06 '12 at 21:05
PS. If you could post the top of an actual serialized file, it should be easier for people to see potential shortcuts. – alexis Dec 06 '12 at 21:08
I would if I could but the actual serialized data contains some client specific information which is not releaseable. Besides the files are in a binary format (not text). They follow the Protocol Buffer wire format for repeated fields as listed here https://developers.google.com/protocol-buffers/docs/encoding Also for reference, the repeated message is NOT packed if that makes a difference. – g19fanatic Dec 06 '12 at 22:44
Well, it's not the actual data but the actual message definitions that I was wondering about. Perhaps my (extended) answer makes that clearer. – alexis Dec 07 '12 at 10:03
I applaud you for the tenacity in helping me. Unfortunately the actual message description is much more complex than what I have listed and I do not yet have approval from our client to make it public.... that being said, our server team is still resisting any movement towards accepting our resolution. I will be moving forward with trying to write a custom binary reader to parse this repeated field stream. We were hoping this wasn't the solution but no one really sees any alternative. – g19fanatic Dec 07 '12 at 11:51
There is no need to write a complete binary reader - the PB format is basically either object or int. Objects have their length specified in the stream, and ints are read simply by reading bytes until the MSB is not set. This means a top level index can be efficiently generated as described in my answer. – Andrew Alcock Dec 11 '12 at 13:20

score 0 · Answer 4 · answered Dec 10 '12 at 23:34

I'm afraid, you have to do the outer PB-stream decoding yourself. once you found the start of your items (hopefully not packed) you can use PB for the individual items. then you can work on literally endless streams. of course you have to do some serious testing, but i'd give it a try.

jthill · Answer 5 · 2012-12-15T04:42:59.240

If you're certain a real database engine running with a properly-normalized schema can't do the job (I'd test this, myself, database-engine guys have been thinking about how to solve puzzles like this for a while now), then try this:

Batch up enough records that the data for the smallest field for the collection occupies at least a large-ish handful of sectors. Write the data for the fields in those records to separate contiguous chunks starting at 4KB boundaries, with a header at the front of each set saying where to find the chunk boundaries. Then read only the chunks for the fields you want.

If you really need performance but you still have spinny things, place the chunks in separate files on separate disks. Skipping's as slow as reading on those things until you're skipping really large chunks of data.

Edit: I see you don't want to change the on-disk format, but since you're trying to avoid reading unwanted data there really isn't a choice.

Getting one repeated field from a file instead of needing to load the whole file

5 Answers5