Storing dynamic objects with growing lists on disk

Question

Okay so I have been developing a system so far in main memory that has many different objects and each object stores lists of other objects in the system. Now I want to move this to persistent storage. I'm not looking for the obvious answer of using a DBMS because the point is that I'm writing a custom database for my system.

Now for each object I'm assigning an ID. The ids can be looked up in a table to find the block and offset for the location of the data for that object. Now each object has lists/sets that point to other objects in the system. So obviously in the storage they will be lists of 8 byte (using longs for the ids) ids that can be used to find the other objects. Now my question here is that I know the lists will be growing over time so they need room to grow. My best thought so far for storing the lists so that I won't need to move around objects when they grow is to have each list assigned an id just like the objects so that they can looked up in a table just like the objects to find them on the disk.

Now each list portion will have a set allocated space to store 10 objects and then at the end will be the id of the next list portion if it contains more objects. This seems like a decent way to do it and to deal with constantly growing objects but I'm wondering if there are any better approaches. I would store the indexes in memory (space permitting) so given an object id, the lookup is in memory then it would take 1 I/O to find get it's data and list ids from the disk. then for each list you want to traverse through it would take another lookup and I/O for every 10 objects in the list or less if the block is cached.

The number of I/O's is not terrible and I would try to keep locality of list portions to eliminate unnecessary I/Os, but is there a better way of doing this? Am I right to try and store the lists separate from the object or should I consider methods of storing them with the object's data. My worry about doing that is that as one list grows it will run into another list and then need to be fragmented and this can get more complicated. Any suggestions are appreciated and thanks in advance.

score 1 · Accepted Answer · answered Dec 14 '11 at 15:15

Your idea of having these expandable lists is good. I think your explanation is missing some details (ie: ordered lists or not, what do you mean by trying to separate lists from objects, a diagram of these lists might help).

I would keep a sorted index in memory for fast access. The index would have list id, and location on disk. If you're interested in range queries go with a B tree approach, otherwise you could use a hashmap to store these indeces.

A further improvement, if you're doing searching on the lists, is to keep them sorted... or at least semi sorted so that you can group similar lists in the same chunk. This would speed up searching in the lists if you every so often cache to memory say the boundaries of each chunk (nodes with values b/w 1-9, 10-25, etc). Merge sort is probably the best sort for lists. Or even better, when you insert nodes in the lists insert in the correct location so the list is always sorted. Then look up with binary search. If data is not indexed properly and not sorted, you're going to disk multiple times for queries and in this case any search you use will give you linear time because of disk time.

You can also cache data nodes of the 10% most looked up nodes/lists.

Depending on the size of these lists (and how manyc chunks you have for them), you could use some RAID so you can get some parallel reads/writes.

I already finished most of my implementation. I ended up deciding to use sets instead of lists and since I also needed hashmaps I created my own implmenetation for the `Map` interface except with K,V extends PersistentObject. Then I made a set based off of the map as a backing. I then implemented the PersistentObject as a class that uses reflection to find all subclass non-transient fields and save them accordingly so that each object didn't need to define it's own way of saving. Everything turned out pretty overall. I'll keep your answer in mind for caching later though. — user1084563, Dec 16 '11 at 00:00
@user1084563 I like it :) What do you use this system for, if you don't mind me asking. — Adrian, Dec 16 '11 at 02:42
Its actually a system for storing a semantic network built off of information extracted from natural language input using stanford nlp toolkit. Obviously I could have used a database but I decided to write the storage myself as a side project for my database systems course and I wanted it to be as dynamic as possible so that the persistent datatypes could be swapped easily for the existing main memory datatypes. Also I wanted changes to object definitions to be automatically reflected in the storage so that the object doesn't then have to redefine how to save and load itself. — user1084563, Dec 16 '11 at 08:47

Storing dynamic objects with growing lists on disk

1 Answers1