13

Say instead of documents I have small trees that I need to store in a Lucene index. How do I go about doing that?

An example node in the tree:

class Node
{
    String data;
    String type;
    List<Node> children;
}

In the above node the "data" member variable is a space separated string of words, so that needs to be full-text searchable. The "type" member variable is just a single word.

The search query will be a tree itself and will search both the data and type in each node and also the structure of the tree for a match. Before matching against a child node, the query must first match the parent node data and type. Approximate matching on the data value is acceptable.

What's the best way to index this kind of data? If Lucene does not directly support indexing these data then can this be done by Solr or Elasticsearch?

I took a quick look at neo4j, but it seems to store an entire graph in the db, not a large collection (say billions or trillions) of small tree structures. Or my understanding was wrong?

Also, is a non-Lucene based NoSQL solution is better suited for this?

Golam Kawsar
  • 760
  • 1
  • 8
  • 21
  • What are you looking to find when you search. If you have NodeB as a child of NodeA, and NodeB has text FOO, when searching for FOO, do you want to return NodeB, or NodeA? – sbridges Apr 02 '12 at 02:32
  • Queries will be matched against tree structure and tree data. So if the data in NodeA has already been matched then the occurrence of FOO in NodeB will constitute a complete match. – Golam Kawsar Apr 02 '12 at 02:40
  • Are you saying FOO must be in NodeA and NodeB? Or that type must match in NodeA, but you don't care if type matches in NodeB. – sbridges Apr 02 '12 at 02:44
  • FOO will never be searched in isolation. The query itself will be a tree! So, we might search for a tree that has NodeA.data = "BAR" and its child NodeB.data = "FOO". A successful match will be all trees whose first Node matches NodeA (both data and type) and child node matches NodeB (both type and data). Approximate matches on the data value is acceptable. – Golam Kawsar Apr 02 '12 at 02:48
  • something like neo4j would probably be better – sbridges Apr 02 '12 at 03:59
  • I'm going to take a guess that CouchDB or MongoDB would probably be a better fit for you. Its unclear if you are trying to represent graphs (nodes are reused for other trees) or true trees where the nodes are not reused. – Adam Gent Apr 02 '12 at 04:22
  • This may be interesting as well, http://renaud.delbru.fr/doc/pub/eswc2010-siren.pdf – sbridges Apr 02 '12 at 04:23
  • Neo4j has lucene indexing and a query language that can walk graphs "smartly". Riak has MapReduce and **links**, which can traverse graphs by following them. Riak can better support your "billions" :-) Mongo has MapReduce. – Jesvin Jose Apr 02 '12 at 05:21
  • Thanks everyone for your suggestions! – Golam Kawsar Apr 03 '12 at 03:07

4 Answers4

11

Another approach is to store a representation of the current node's location in the tree. For example, the 17th leaf of the 3rd 2nd-level node of the 1st 1st-level node of the 14th tree would be represented as 014.001.003.017.

Assuming 'treepath' is the field name of the tree location, you would query on 'treepath:014*' to find all nodes and leaves in the 14th tree. Similarly, to find all of the children of the 14th tree you would query on 'treepath:014.*'.

The major problem with this approach is that moving branches around requires re-ordering every branch after the branch that was moved. If your trees are relatively static, that may only be a minor problem in practice.

(I've seen this approach called either a 'path enumeration' or a 'Dewey Decimal' representation.)

Mark Leighton Fisher
  • 5,609
  • 2
  • 18
  • 29
3

This requirement and the solution is captured here: Proposal for nested docs

This design was subsequently implemented both by core Lucene and Elastic Search. The BlockJoinQuery is the core Lucene implementation and Elastic Search look to have an implementation as outlined here: Elastic search nested docs

MarkH
  • 823
  • 6
  • 10
2

I suggest Neo4j. Tree is, after all, just a special, restrained graph.

Check out this great discussion on whether you should store a tree in Neo4j:

http://www.mail-archive.com/user@lists.neo4j.org/msg03256.html

Marko Bonaci
  • 5,622
  • 2
  • 34
  • 55
  • Thanks for your answer, but your links is broken. Also, does Neo4j allow to store billions (or trillions) of small trees to be indexed? I want to be able to search for trees including their structure and text stored in the nodes. – Golam Kawsar Apr 03 '12 at 14:01
  • Link is not broken, I've just checked. – Marko Bonaci Apr 04 '12 at 10:11
  • Here are few more places where you can find that discussion thread: http://lists.neo4j.org/pipermail/user/2010-April/003313.html http://neo4j.org/nabble/#nabble-td700300 – Marko Bonaci Apr 04 '12 at 10:16
  • There are loads of stuff concerning tree structures in Neo4j user group: https://groups.google.com/forum/?fromgroups#!searchin/neo4j/tree – Marko Bonaci Apr 04 '12 at 10:29
  • Thanks mbonaci. The link was not working when I first tried (I tried a few times). I will check the links you pointed to. Thanks! – Golam Kawsar Apr 04 '12 at 13:05
0

There is a project SIREn http://rdelbru.github.io/SIREn which deals with 'in-depth' trees, addressing. Internally uses Dewey numbering (http://www.ipl.org/div/farq/deweyFARQ.html) ....

maryoush
  • 173
  • 1
  • 12