Need an efficient way to store/query json in a SQL database

Question

I'm implementing a service where each user must have his own json/document database. Beyond letting the user to query json documents by example, the database must also support ACID transactions involving multiple documents, so I have discarded using Couch/Mongo or other NoSQL databases(can't use RavenDB since it must run on Unix systems).

With that in mind I've been trying to figure a way to implement that on top of a SQL database. Here's what I have came up with so far:

CREATE TABLE documents (
  id INTEGER PRIMARY KEY,
  doc TEXT
);

CREATE TABLE indexes (
  id INTEGER PRIMARY KEY,
  property TEXT,
  value TEXT,
  document_id INTEGER
)

Each user would have a database with these two tables, and the user would have to declare which fields he needed to query so the system could properly populate the 'Indexes' table. So if user 'A' configures his account to enable queries by 'name' and 'age', everytime that user inserts a document that has a 'name' or 'age' property the system would also insert a record to the 'indexes' table, where the 'property' column would contain name/age , 'value' would contain the property value and 'document_id' would point to the corresponding document.

For example, let's say the user inserts the following doc:

'{"name" : "Foo", "age" 43}'

This would result in a insert to the 'documents' table and two more inserts to the 'indexes' table:

INSERT INTO documents (id,doc) VALUES (1, '{"name" : "Foo", "age" 43}');
INSERT INTO indexes (property, value, document_id) VALUES ('name', 'foo', 1);
INSERT INTO indexes (property, value, document_id) VALUES ('age', '43', 1);

Then let's say that user 'A' sent the service the following query:

'{"name": "Foo", "age": 43}' //(the queries are also json documents).

This query would be translated to the following SQL:

SELECT doc FROM documents
WHERE id IN (SELECT document_id FROM indexes
             WHERE document_id IN (SELECT document_id FROM indexes
                                   WHERE property = 'name' AND value = 'Foo')
             AND property = 'age' AND value = '43')

My questions:

Knowing that the user may be able to use a high number of conditions in his queries(lets say 20-30 AND conditions), which would cause the subquery nesting be very high, how efficient would the above SELECT query be on most database systems(postgres, mysql...)?
Is the above solution viable for a database that will eventually contain millions/billions of json documents?
Is there a better way to meet my requirements?
Is there scalable document database that can do ACID transactions involving multiple documents and runs on Unix systems?

PostgreSQL 9.2 will support a JSON data type and with some functions (e.g. written in JavaScript) the above should be possible. See here for an example: http://people.planetpostgresql.org/andrew/index.php?/archives/249-Using-PLV8-to-index-JSON.html — , Jun 25 '12 at 15:46
See if CouchDB will work for you: "CouchDB provides ACID semantics. It does this by implementing a form of Multi-Version Concurrency Control, meaning that CouchDB can handle a high volume of concurrent readers and writers without conflict." — Void Ray, Jun 25 '12 at 15:58
Interesting tip about PostgreSQL, I will check it out, thanks — Thiago Padilha, Jun 25 '12 at 17:52

score 5 · Accepted Answer · answered Jun 25 '12 at 16:00

Your indexes table is a what is known as Entity-Attribute-Value.

EAV tables are fine for storing information and recalling it when you know the entity. (In your case, finding all the indexes rows when you know the document_id.)

But they are terrible the other way around: Supplying Attribute-Value combinations to search for an Entity. Which is exactly what you have in your final query. As more and more entities share the same attribute-value combinations (such as name=foo) the query performance degrades.

So, to answer your first two questions:
1. The query, as written, requires n sub-queries when searching for n properties. This will scale very poorly as n grows.
2. As the number of records grows it will degrade, especially with millions/billions records.

In general, if you read about EAV, people strongly recommend shying away from it.

And, worse still, there isn't really a good alternative within SQL. The standard way to optimise a search is with an index, which can easily be modelled as a sorted data-set. But you would then need many indexes:
- An index on (fieldX, fieldY, fieldZ) is great if you search on all three columns.
- But it sucks if you have to search on just fieldZ.

If you can re-model this with a traditional table, with a fixed number of columns, and have the space to apply every index combination you would ever need, that would be you most performant model.

If you can't fix the number of columns (new properties coming along all the time) and/or you don't have space for all the different combinations of index, you seem to be stuck with EAV. Which will work, but it will not scale very well in terms of 'instantaneous' results.

NOTE: If you do stick with EAV, have you tested this query structure?

  SELECT
    document_id
  FROM
    indexes
  WHERE
       (property = 'name' AND value = 'Foo')
    OR (property = 'age'  AND value = '43' )
  GROUP BY
    document_id
  HAVING
    COUNT(*) = 2

This assumes that (document_id, property, value) is unique. Otherwise one document could have ('name', 'foo') twice, and so pass the COUNT(*) clause.

I don't think the 'indexes' table is modeling data using the 'Entity-Attribute-Value' method, it is just a way to 'manually' index schemaless data in the 'documents' table. I forgot to mention that the name and value columns will also be indexed, don't you think that will make the queries to run fast? — Thiago Padilha, Jun 25 '12 at 17:57
@ThiadodeArruda - Unfortunately, it's exactly EAV. Your `Documents` are the `Entities`. Your `Properties` are the `Attributes`. And your `Values` are the, well, I think you get that point. Indexing `(property, value, document_id)` will certainly improve things compared to not doing it, but that's a minimum working assumption. You still have all the difficulties of EAV. It will always be significantly slower than a 'traditional' table. And the more records that share the same value for any given property, the slower it will get. And the more properties you search on, the slower still. — MatBailie, Jun 25 '12 at 19:11

Need an efficient way to store/query json in a SQL database

1 Answers1