I have an application that (among other things) stores a file system tree in a neo4j graph. That is to say that each directory and file is a node. Some of these files are Office documents, text or pdf files and I would like to provide some search functionality.
Search functionality should scan node properties and file content and return most relevant nodes.
--------------------------------------------------
update for extra information:
The graph allows to filter out subset of files. File nodes also contain custom metadata that needs to be searched. One of many applications are:
A user searches for a "term" > use of graph to find files that this search applies to (depending on user groups & rights for example) then search both node properties for "term" and file content > return most relevant results.
Possibly some files might be linked to others for some reason or another and those files should also be searched but with less priority (a "term" hit should idealy count for less than a hit on the initial file)
The real life case level of complexity is tenfold this so I cannot substitute/remove use of graph DB, or influence of the DB results in the result relevancy.
--------------------------------------------------
My questions are:
- what is the best way of implementing this?
- Should I extract the file content and place them in a indexed property for each node?
- What would the drawbacks of doing this be?
- Are there any better ways of going about it?
Thanks in advance guys.
Further details:
- PHP web application
- Using Rexster to load and access the neo4j graph
- query language = gremlin (groovy)