0

I am attempting to create a wrapper around SQLite databases that would allow schema-less access from .net applications. The idea is to have each entity as simply a collection or attribute value pairs.

The priority is simplicity and ease of use for the developer, rather than performance, but I don't want to completely disregard memory usage and disc read times.

The architecture of my wrapper will mean that all primary keys at a minimum will be read in for the first query (and then cached), however the other attributes could be read in later as required.

The question is, is it better to create (at run time) a column per attribute name, and pull in the columns corresponding to the attributes requested, or to have a record per attribute, and pull in the records of attributes requested.

By the very nature of the application, I won't know in advance what order attributes will be read and whether or not all attributes will be read.

Marius Burz
  • 4,555
  • 2
  • 18
  • 28
ForbesLindesay
  • 10,482
  • 3
  • 47
  • 74

1 Answers1

0

Do not reinvent the wheel, especially when the one you try to replace is already round.

So let's see how does the existing wheel compare to the one you try to reinvent.

1) Existing wheel (table with columns) We have a table Entity which has a structure like this:

id    INTEGER PRIMARY KEY
attr1 INTEGER
attr2 REAL -- this holds a date
attr3 TEXT
attr4 TEXT

2) Reinvented wheel, attributes table Say we'll create a table EntityAttributes for this that we'll hold all attributes from Entity at 1), but the question is how shall it look like? Should we use TEXT as a type for all attribute values and handle the back and forth conversion in the application? Then the table could look like this:

entity_id  INTEGER
attr_name  TEXT
attr_value TEXT
           PRIMARY KEY(entity_id, attr_name)

This is actually not such a big drama since internally Sqlite stores everything as TEXT, but when it comes to any normal RDBMS this is a huge loss.

The alternative is to go for a complex attributes table that supports multiple types. This could look like:

entity_id  INTEGER
attr_name  TEXT
attr_int   INTEGER
attr_real  REAL
attr_text  TEXT
           PRIMARY KEY(entity_id, attr_name)

There is another possibility, to use a master table for the pair (entity_id, entity_name) with a surrogate key and a table for each data type (which has as PK the surrogate key value from the main attributes table), but this is becoming too much for this discussion.

The most important thing to see is that right from the start there are problems we need to workaround when we pick your wheel.

Now let's move on and talk a bit about performance. All values from a row are normally stored next(or close) to each other so IO cost is normally low. That can't be said about multiple rows, especially if they are not inserted one after another. You can always reindex the database file to optimize the access, but it will not help much if the rows are not created one after another.
Another disadvantage of your solution is that it actually need more storage than the natural solution (not by much but still more).

All good till now? How about this: you have the following simple SQL query:

SELECT id, attr1, attr2
  FROM Entity
 ORDER BY attr1;

Try to do the same by using the attribute per row approach see how easy it is. And this is just an example, many other come to mind.

Compared to traditional RDBMS the attributes table solution looks only worse in your case (for once because they do at better job at keeping the column values of a row together and for another because they support real different types unlike Sqlite). When using a Key-value data store such as Redis is a different matter.

All in all, there's much to say against the attribute per row approach and really no advantage if you can do the same by using a few columns. The attributes table is useful when you actually need it, something like unknown number of (dynamic) attributes comes to mind but in your case I would stay with the natural solution.

Marius Burz
  • 4,555
  • 2
  • 18
  • 28
  • There will be a relatively unknown number of attributes (My library only knows about each one once it's been set, similar to Amazon's SimpleDB for example). I'm not hugely fussed about the ease of querying in that sense as I intend it to be used just as a local cache, and only ever accessed through my data layer, which has no domain specific knowledge. The locality of items on the disk is a potential issue (although I suspect it would be more common for someone to read one attribute for all entities than all attributes for an entity), not to say your way still isn't the right way to go – ForbesLindesay Sep 03 '11 at 11:01
  • Is the cost of adding a column not a lot higher than the cost of adding a row though? – ForbesLindesay Sep 03 '11 at 11:02
  • Compare adding a column to adding as many rows as there are entities if the new column must be set to a specific value. If it is allowed to be null, you must not insert a row for the new attribute for every entity, but this complicates the logic in case of updates and reads. As about my way is not the right way to go, please detail. – Marius Burz Sep 03 '11 at 12:02
  • Your way may be the right way to go. The way the program works, if the value is just the default value it can be null in the database. So rows wouldn't need to be added for default values, only when an actual value is present. – ForbesLindesay Sep 06 '11 at 08:43
  • In this case, when you update an attribute, you have to check whether the `update` updated any rows at all. If not, you have to insert the value for the respective attribute. I would still go with columns inside a single table, except when there are really a lot of attributes created dynamically. As about the tables holding the attributes, I used `attr_name` just for it's easy to make a point. Best would be to have an `attributes(id, name)` table and use an `attr_id` column instead of the `attr_name` in `EntityAttributes`. – Marius Burz Sep 06 '11 at 09:07