dynamic data model

Question

I have a project that requires user-defined attributes for a particular object at runtime (Lets say a person object in this example). The project will have many different users (1000 +), each defining their own unique attributes for their own sets of 'Person' objects.

(Eg - user #1 will have a set of defined attributes, which will apply to all person objects 'owned' by this user. Mutliply this by 1000 users, and that's the bottom line minimum number of users the app will work with.) These attributes will be used to query the people object and return results.

I think these are the possible approaches I can use. I will be using C# (and any version of .NET 3.5 or 4), and have a free reign re: what to use for a datastore. (I have mysql and mssql available, although have the freedom to use any software, as long as it will fit the bill)

Have I missed anything, or made any incorrect assumptions in my assessment?

Out of these choices - what solution would you go for?

Hybrid EAV object model. (Define the database using normal relational model, and have a 'property bag' table for the Person table).

Downsides: many joins per / query. Poor performance. Can hit a limit of the number of joins / tables used in a query.

I've knocked up a quick sample, that has a Subsonic 2.x 'esqe interface:
```
Select().From().Where  ... etc
```
Which generates the correct joins, then filters + pivots the returned data in c#, to return a datatable configured with the correctly typed data-set.

I have yet to load test this solution. It's based on the EA advice in this Microsoft whitepaper: SQL Server 2008 RTM Documents Best Practices for Semantic Data Modeling for Performance and Scalability
Allow the user to dynamically create / alter the object's table at run-time. This solution is what I believe NHibernate does in the background when using dynamic properties, as discussed where

http://bartreyserhove.blogspot.com/2008/02/dynamic-domain-mode-using-nhibernate.html

Downsides:

As the system grows, the number of columns defined will get very large, and may hit the max number of columns. If there are 1000 users, each with 10 distinct attributes for their 'Person' objects, then we'd need a table holding 10k columns. Not scalable in this scenario.

I guess I could allow a person attribute table per user, but if there are 1000 users to start, that's 1000 tables plus the other 10 odd in the app.

I'm unsure if this would be scalable - but it doesn't seem so. Someone please correct me if I an incorrect!
Use a NoSQL datastore, such as CouchDb / MongoDb

From what I have read, these aren't yet proven in large scale apps, based on strings, and are very early in development phase. IF I am incorrect in this assessment, can someone let me know?

http://www.eflorenzano.com/blog/post/why-couchdb-sucks/
Using XML column in the people table to store attributes

Drawbacks - no indexing on querying, so every column would need to be retrieved and queried to return a resultset, resulting in poor query performance.
Serializing an object graph to the database.

Drawbacks - no indexing on querying, so every column would need to be retrieved and queried to return a resultset, resulting in poor query performance.
C# bindings for berkelyDB

From what I read here: http://www.dinosaurtech.com/2009/berkeley-db-c-bindings/

Berkeley Db has definitely proven to be useful, but as Robert pointed out – there is no easy interface. Your entire wOO wrapper has to be hand coded, and all of your indices are hand maintained. It is much more difficult than SQL / linq-to-sql, but that’s the price you pay for ridiculous speed.

Seems a large overhead - however if anyone can provide a link to a tutorial on how to maintain the indices in C# - it could be a goer.
SQL / RDF hybrid. Odd I didn't think of this before. Similar to option 1, but instead of an "property bag" table, just XREF to a RDF store? Querying would them involve 2 steps - query the RDF store for people hitting the correct attributes, to return the person object(s), and use the ID's for these person object in the SQL query to return the relational data. Extra overhead, but could be a goer.

+1 for good research before asking a question. Look at MongoDB. Have a listen to [this](http://www.dotnetrocks.com/default.aspx?showNum=507) podcast & it seems to be what you are looking for. I am sorry, I have not used it personally. — shahkalpesh, Jan 10 '10 at 13:54
For info on 4; you can combine xml columns with calculated+persisted+indexed columns, but you end up back in the "too many columns" scenario. Do you need to be able to **search** on the dynamic properties? This may be the code factor that differentiate between a property-bag via key/value pairs in a table, and either serialization/xml. — Marc Gravell, Jan 10 '10 at 14:17
Yes - the primary focus of the app is to search by the dynamic properties in an efficient way. I've not thought about combining xml columns with calculated+persisted+indexed columns - do you have any resources explaining this further? — James, Jan 10 '10 at 14:27
+1 and favo for the excellent written question. Will follow the question because you've made me curious. — bastijn, Jan 11 '10 at 07:26

score 7 · Accepted Answer · answered Jan 28 '10 at 00:23

The ESENT database engine on Windows is used heavily for this kind of semi-structured data. One example is Microsoft Exchange which, like your application, has thousands of users where each user can define their own set of properties (MAPI named properties). Exchange uses a slightly modified version of ESENT.

ESENT has a lot of features that enable applications with large meta-data requirements: each ESENT table can have about ~32K columns defined; tables, indexes and columns can be added at runtime; sparse columns don't take up any record space when not set; and template tables can reduce the space used by the meta-data itself. It is common for large applications to have thousands of tables/indexes.

In this case you can have one table per user and create the per-user columns in the table, creating indexes on any columns that you want to query. That would be similar to the way that some versions of Exchange store their data. The downside of this approach is that ESENT doesn't have a query engine so you will have to hand-craft your queries as MakeKey/Seek/MoveNext calls.

A managed wrapper for ESENT is here:

http://managedesent.codeplex.com/

Wow! Yes I certainly missed this out in my research. It *seems* to good to be true. I wonder if there is anyone using this to run a web-app (other than exchange)... Hmnnn... — James, Jan 29 '10 at 13:27
Sorry for late acceptance - esent wins hands down, even with the slightly verbose querying api! — James, Feb 01 '10 at 09:29

George Polevoy · Answer 2 · 2010-01-11T08:09:57.470

In a EAV model you don't have to have many joins, as you can just have the joins you need for the query filtering. For the resultset, return property entries as a separate rowset. That is what we are doing in our EAV implementation.

For example, a query might return persons with extended property 'Age' > 18:

Properties table:

1        Age
2        NickName

First resultset:

PersonID Name
1        John
2        Mary

second resultset:

PersonID PropertyID Value
1        1         24
1        2         'Neo'
2        1         32
2        2         'Pocahontas'

For the first resultset, you need an inner join for the 'age' extended property to query the basic Person object entity part:

select p.ID, p.Name from Persons p
join PersonExtendedProperties pp
on p.ID = pp.PersonID
where pp.PropertyName = 'Age'
and pp.PropertyValue > 18 -- probably need to convert to integer here

For the second resultset, we are making an outer join of the first resultset with PersonExtendedProperties table to get the rest of the extended properties. It's a 'narrow' resultset, we do not pivot the properties in sql, so we don't need multiple joins here.

Actually we use separate tables for different types to avoid data type conversion, to have extended properties indexed and easily queriable.

very interesting... What RDMS are you using as a backend? I'm now (time permitting -ie, on weekend) going to modify my original test (option 1), to reflect this method (only join per queried attribute), and pivot the table with the contents of the outerjoin (with some filtering etc). I require this final pivot, so I can plug any EAV query results into anything that accepts a datatable... Hmmnnn... If only I had more time.. My gut feeling is this method will outperform the original method (due to less joins), as long as we are dealing with relativly small attribute collections / data. — James, Jan 11 '10 at 14:04
MSSQLServer. The query itself (without feeding the resultset) actually outperforms some native wide tables in my tests. (4 parameters involved in filtering from 20 available). Maybe it's due to logistics of 'wide' table's indexes on disk. — George Polevoy, Jan 11 '10 at 15:41

score 0 · Answer 3 · edited Jun 26 '13 at 06:18

For a problem similar to your problem, we have used the "XML Column" approach (the fourth one in your survey of methods). But you should note that many databases (DBMS) support index for xml values.

I recommend you to use one table for Person which contains one xml column along with other common columns. In other words, design the Person table with columns that are common for all person records and add a single xml column for dynamic and differing attributes.

We are using Oracle. it supports index for its xml-type. Two types of indices are supported: 1- XMLIndex for indexing elements and attributes within an xml, 2- Oracle Text Index for enabling full-text search in text fields of the xml.

For example, in Oracle you can create an index such as:

CREATE INDEX index1 ON table_name (XMLCast(XMLQuery ('$p/PurchaseOrder/Reference' 
  PASSING XML_Column AS "p" RETURNING CONTENT) AS VARCHAR2(128)));

and xml-query is supported in select queries:

SELECT count(*) FROM purchaseorder
  WHERE XMLCast(XMLQuery('$p/PurchaseOrder/Reference'
  PASSING OBJECT_VALUE AS "p" RETURNING CONTENT)
  AS INTEGER) = 25;

As I know, other databases such as PostgreSQL and MS SQL Server (but not mysql) support such index models for xml value.

see also: http://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb_indexing.htm#CHDEADIH

martinr · Answer 4 · 2010-01-10T14:51:15.817

0

My recommendation:

Allow properties to be marked as indexable. Have a smallish hard limit on number of indexable properties, and on columns per object. Have a large hard limit on total column types in all objects.

Implement indexes as separate tables (one per index) joined with main table of data (main table has large unique key for object). (Index tables can then be created/dropped as required).

Serialize the data, including the index columns, plus put the index propertoes in first class relational columns in their dedicated index tables. Use JSON instead of XML to save space in the table. Enforce short column name policy (or long display name and short stored name policy) to save space and increase performance.

Use quarks for field identifiers (but only in the main engine to save RAM and speed some read operations -- don't rely on quark pointer comparison in all cases).

My thought on your options:

1 is a possible. Performance clearly will be lower than if field ID columns not stored.

2 is a no in general DB engines not all happy about dynamic schema changes. But a possible yes if your DB engine is good at this.

3 Possible.

4 Yes though I'd use JSON.

5 Seems like 4 only less optimized??

6 Sounds good; would go with if happy to try something new and also if happy about reliability and performance but usually would want to go with more mainstream technology. I'd also like to reduce the number of engines involved in coordinating a transaction to less then would be true here.

Edit: But of course though I've recommened something there can be no general right answer here -- profile various data models and approaches with your data to see what runs best for your application.

Edit: Changed last edit wording.

edited Jan 10 '10 at 14:51

answered Jan 10 '10 at 14:40

martinr

3,794
2
17
15

Hi Martinr, thank you! Great points, especially re: #6 & the number of engines... I'm unsure about 2 items- could you please elaborate? re: 1: "Performance clearly will be lower than if field ID columns not stored." I'm unclear as to what you mean - could you please expand? re: "Implement indexes as separate tables". If I understand correctly, does this mean that for each of the 1000 'users' of the system, lets say that they will have/use 10 indexable columns each, that would mean there'd be 10k 'index' tables in the database? Or have I completely misunderstood? – James Jan 10 '10 at 15:18
RE:#1 I assume that tyour property bag would have KEY & VALUE columns. All I am saying in my comment on 1 is its probably better not to store KEY. RE:"10k index tables?" I'm not sure I believe you'd have 10k unique properties for a person!! (more like 2000 tops) but theoretically if you do start to have large numbers of tables you could have N standard numeric indexes available, M standard string indexes available and user maps his properties onto those, and you partition the indexes a bit among the user population. I concede constant schema changes on 10k tables sounds a bit much. – martinr Jan 10 '10 at 15:41
I mean by "better not to store KEY" to use another solution (not property bag) where you don't have a SQL KEY field. – martinr Jan 10 '10 at 15:42
Smashing - makes perfect sense. Thank you for the clarification! – James Jan 10 '10 at 15:48
:-) Ideally schema changes are not going on frequently. Ideally all schemas are specified once. But we are talking about a system where the user schema can change if it needs to. It may make sense to put all indexes on the one main table, with a USERID. Mapping between index fields and user fields would then be in the app code. Maybe the one main table is broken down into several tables based on USERID value. Great question James. – martinr Jan 10 '10 at 15:50

score 0 · Answer 5 · answered Jan 10 '10 at 22:50

0

Assuming you an place a limit, N, on how many custom attributes each user can define; just add N extra columns to the Person table. Then have a separate table where you store per-user metadata to describe how to interpret the contents of those columns for each user. Similar to #1 once you've read in the data, but no joins needed to pull in the custom attributes.

answered Jan 10 '10 at 22:50

rwhit

89
4

Sounds good - however won't this be limiting objects to 'single attributes'? If I wanted to store say, a collection of top 10 books for a person, in the hybrid EAV model, I could set multiple "favourite_book" attributes, and query like: "WHERE pp1.PropertyName = 'favourite_book' and pp1.PropertyValue = 'catch22' AND pp2.PropertyName = 'favourite_book' and pp2.PropertyValue = 'bible'" If I am to set a limit on the number of custom attributes (use table columns), I wouldnt be able to store this data. (I could but would run out of columns, and queryies would be hard to generate dynamically) – James Jan 11 '10 at 13:50

dynamic data model

5 Answers5