3

i'm developing a document management software and i'm evaluation a noSql database for storage and search data.

Summary the software act like a file system when items are organized in directory and subdirectory.

Each item of the tree can have n properties used for filter and sort.

Items can also be eventually connected each other with some kind of other relations (other than parent-child).

Items count could be relative large (some millions) and the killer features of the application has to be costant performance in retrieve data (with filters and sort by properties) indipendently from database grow.

I need 3 key feature:

  • Get direct childs of a folder. result must be pageable, sortable and filterable for each document property

  • Get all childs of a folder (all items of the subtree). result must be pageable, sortable and filterable for each document property

  • Get all parents of a folder

I'm a newbie in noSql and actually i use a rdbms (Sql Server) but i hit with performance issue and all limits caused by a fixed schema for document properties. I'm evaluating OrangoDb or OrientDb because i think that it's feature (document oriented and graph oriented) could be the best solution for my design needs.

Can you help me, giving me a suggestion for design the database and the query for this 3 task?

Nb. i need that the result of the query return a dataset with a column for each property:

Es. doc1: p1: v1, p2: v2
    doc2: p1: v1, p3: v3

result:
    name | p1 | p2 | p3
    doc1   v1   v2   null
    doc2   v1   null v3

I'm thinking design an item as:

{ 
  "_id": "_myItemId",
  "name`enter code here`" : "Item1",
  "itemType": "root / folder / file"   
  "parentItemId": "",
  "properties" : [ 
    { name: "Property1", formatType: 0, formatMask: "", value: "Value1" }, 
    { name: "Property2", formatType: 0, formatMask: "", value: "Value2" }, 
    { name: "Property3", formatType: 0, formatMask: "", value: "Value3" }  
  ] 
}

do you have any suggestions for a design able to solve the 3 key features described above?

Thanks

Community
  • 1
  • 1
Claudio
  • 133
  • 2
  • 11
  • Did you see this post at O'Reilly Radar? http://radar.oreilly.com/2015/07/data-modeling-with-multi-model-databases.html – weinberger Oct 11 '15 at 19:01
  • 1
    I have implemented 2 collection in ArangoDb, Items (document collection who contains information about folders and files), and ItemsParents (edge collection who contain relations parent-child of Items). I'm disappointed about performance, surely i'm doing something wrong... i have inserted about 1 million of Items and performance for a simple count of items is terrible... it takes 30/40 minutes.. the AQL query is: LET u = ( FOR item IN Items RETURN item._key ) RETURN LENGTH(u) – where am I doing wrong? – Claudio Oct 13 '15 at 17:04
  • if you only want to count the items, and don't need their actual content it makes more sense to return **NULL** instead of **item.key** - it doesn't need to access the members, much less data is moved. – dothebart Oct 19 '15 at 15:11
  • I'm working on a similar project, did you test below answer performance impact? @Claudio – Yahia Reyhani Jul 24 '16 at 07:33

1 Answers1

8

The approach with graph databases it's very different from other kind of dbms. You can "connect" your entities (Vertex) using Edges, a direct link between one entity and another one. So, first of all, you don't need to store eg. the "parentItemId" for each object like you would do in a Sql or document database, but instead you will have the two / three or many entities with only their specific data; relationships will be handled by the Edges you create between them.

OrientdDb has a very good documentation and some examples to start understanding concepts. EG: the tutorial page: http://orientdb.com/docs/2.1/Tutorial-Working-with-graphs.html explains graphs concepts and has some good examples.

In your specific case, you could have two entity types (Vertex), Folder and Document, and an Edge that you call eg. "ChildOf" (from Document to Folder) or "Contains" (from Folder to Documents). Then there are many queries you can do to find relationships, even specifying the level of nesting etc.

You can create a working schema in the following steps:

1 Create class and edge tpyes:

CREATE CLASS Document Extends V
CREATE CLASS Folder Extends V
CREATE CLASS ChildOf Extends E

2 Insert some documents

INSERT INTO Document SET Title = 'Document 1', Name = '..'
INSERT INTO Document SET Title = 'Document 2', Name = '..'
INSERT INTO Document SET Title = 'Document 3', Name = '..'

3 Insert Folders

INSERT INTO Folder SET Name = 'Folder 1'
INSERT INTO Folder SET Name = 'Folder 2'

4 Create Edges (relationships) between Vertex

CREATE EDGE ChildOf FROM #<specify document rid here> TO #<specify folder rid here>
...

You can also create a folder as a children of another folder, by setting the same "ChildOf" edge between two folders:

 CREATE EDGE ChildOf FROM #<specify children folder rid here> TO #<specify parent folder rid here>
...

5 Query your graph. Get direct childs of a folder, using expand() and in() operators:

Select expand(in('ChildOf')) From #<folder rid> Where ...

Get all childs of a folder, using Traverse query to traverse all childrens from a starting folder:

SELECT FROM (
     TRAVERSE out('ChildOf') FROM #<folder rid> WHILE $depth <= 3 //you can specify the maximum level of nesting
) where $depth > 0 //exclude the first element (the starting folder itself)

Get all parents of a folder, using traverse and "In" graph operator:

SELECT FROM (
         TRAVERSE in('ChildOf') FROM #<folder rid> 
    ) where $depth > 0 //exclude the first element (the starting folder itself)
//here you could filter only the "Folders"
where @class ='Folder'
Stefano Castriotta
  • 2,823
  • 3
  • 16
  • 26
  • Thanks Stefano, i'm currently testing ArangoDb but next step is to evaluate OrientDb. – Claudio Oct 13 '15 at 16:55
  • Concepts are the same, ArangoDb and OrientDb are something similar (ArangoDb has a proprietary language, while OrientDb has a sql like language), you can use the one you like more (i'm not making any kind of sponsor for none of them :)) – Stefano Castriotta Oct 13 '15 at 20:08
  • I'm working on similar project, did you test this answer performance? @claudio – Yahia Reyhani Jul 23 '16 at 23:05