Best way of searching files for content with neo4j graph

Question

I have an application that (among other things) stores a file system tree in a neo4j graph. That is to say that each directory and file is a node. Some of these files are Office documents, text or pdf files and I would like to provide some search functionality.

Search functionality should scan node properties and file content and return most relevant nodes.

--------------------------------------------------

update for extra information:

The graph allows to filter out subset of files. File nodes also contain custom metadata that needs to be searched. One of many applications are:

A user searches for a "term" > use of graph to find files that this search applies to (depending on user groups & rights for example) then search both node properties for "term" and file content > return most relevant results.

Possibly some files might be linked to others for some reason or another and those files should also be searched but with less priority (a "term" hit should idealy count for less than a hit on the initial file)

The real life case level of complexity is tenfold this so I cannot substitute/remove use of graph DB, or influence of the DB results in the result relevancy.

--------------------------------------------------

My questions are:

what is the best way of implementing this?
Should I extract the file content and place them in a indexed property for each node?
What would the drawbacks of doing this be?
Are there any better ways of going about it?

Thanks in advance guys.

Further details:

PHP web application
Using Rexster to load and access the neo4j graph
query language = gremlin (groovy)

Have you looked into document search implementations like [elasticsearch](http://www.elasticsearch.org/) or [solr](https://lucene.apache.org/solr/)? I think they are a better fit for indexing and querying document content. — Thomas Fenzl, Jun 09 '13 at 21:06
@ThomasFenzl Hi, I edited my question : Search functionality should scan node properties and file content and return most relevant nodes. You might be able to clear this up for me but I believe both use Lucene as does neo4j. What advantage would I gain from adding either one of those solutions? It seems a little redundant. especially considering I would like to return nodes. Seems like there would be a lot of overhead. — Pomme.Verte, Jun 10 '13 at 00:29
I don't see where the graph structure comes into the picture in the problem stated. Can you specify how you want to use the graph structure? And both solr and elasticsearch implemented a bunch of interesting features on top of lucene. [Faceted Search](http://www.elasticsearch.org/guide/reference/api/search/facets/) for example. — Thomas Fenzl, Jun 10 '13 at 06:33
@ThomasFenzl I Updated my question again. Essentially the second part (with files linked to others) is an optional bonus. What I'm trying to illustrate is that the graph node properties and structure are just as (if not more) important as the hits within the file content. Thanks again — Pomme.Verte, Jun 10 '13 at 16:28
Got similar requirements, has any new features being added to neo4j to support document search capabilities. — khussain, Feb 24 '22 at 00:10

score 2 · Answer 1 · answered Jun 09 '13 at 20:37

2

If you're wanting to do a file content scan, your probably better off choosing another data store for the file content. Neo4j would work great for searching things like file names and directory structures, but I believe you're talking about doing a byte array scan, and there are better systems out there for it.

answered Jun 09 '13 at 20:37

Nicholas

7,403
10
48
76

Thanks for taking the time to answer Nicholas. I have edited my question and added a comment in response to Thomas Fenzl's comment that might either shed light on my problem or show how ignorant I am ;) – Pomme.Verte Jun 10 '13 at 00:32

Best way of searching files for content with neo4j graph

1 Answers1