I'm implementing a service where each user must have his own json/document database. Beyond letting the user to query json documents by example, the database must also support ACID transactions involving multiple documents, so I have discarded using Couch/Mongo or other NoSQL databases(can't use RavenDB since it must run on Unix systems).
With that in mind I've been trying to figure a way to implement that on top of a SQL database. Here's what I have came up with so far:
CREATE TABLE documents (
id INTEGER PRIMARY KEY,
doc TEXT
);
CREATE TABLE indexes (
id INTEGER PRIMARY KEY,
property TEXT,
value TEXT,
document_id INTEGER
)
Each user would have a database with these two tables, and the user would have to declare which fields he needed to query so the system could properly populate the 'Indexes' table. So if user 'A' configures his account to enable queries by 'name' and 'age', everytime that user inserts a document that has a 'name' or 'age' property the system would also insert a record to the 'indexes' table, where the 'property' column would contain name/age , 'value' would contain the property value and 'document_id' would point to the corresponding document.
For example, let's say the user inserts the following doc:
'{"name" : "Foo", "age" 43}'
This would result in a insert to the 'documents' table and two more inserts to the 'indexes' table:
INSERT INTO documents (id,doc) VALUES (1, '{"name" : "Foo", "age" 43}');
INSERT INTO indexes (property, value, document_id) VALUES ('name', 'foo', 1);
INSERT INTO indexes (property, value, document_id) VALUES ('age', '43', 1);
Then let's say that user 'A' sent the service the following query:
'{"name": "Foo", "age": 43}' //(the queries are also json documents).
This query would be translated to the following SQL:
SELECT doc FROM documents
WHERE id IN (SELECT document_id FROM indexes
WHERE document_id IN (SELECT document_id FROM indexes
WHERE property = 'name' AND value = 'Foo')
AND property = 'age' AND value = '43')
My questions:
- Knowing that the user may be able to use a high number of conditions in his queries(lets say 20-30 AND conditions), which would cause the subquery nesting be very high, how efficient would the above SELECT query be on most database systems(postgres, mysql...)?
- Is the above solution viable for a database that will eventually contain millions/billions of json documents?
- Is there a better way to meet my requirements?
- Is there scalable document database that can do ACID transactions involving multiple documents and runs on Unix systems?