Neo4J: Binary File storage and Text Search "stack"

Question

I have a project I would like to work on which I feel is a beautiful case for Neo4j. But there are aspects about implementing this that I do not understand enough to succinctly list my questions. So instead, I'll let the scenario speak for itself:

Scenario: In simplicity, I want to build an application that will allow Users who will receive files of various types such as docs, excel, word, images, audio clips and even videos - although not so much videos, and allow them to upload and categorize these.

With each file they will enter in any and all associations. Examples:

If Joe authors a PDF, Joe is associated with the PDF.
If a DOC says that Sally is Mary's mother, Sally is associated with Mary.
If Bill sent an email to Jane, Bill is associated with Jane (and the email).
If company X sends an invoice (Excel grid) to company Y, X is associated with Y.

and so on...

So the basic goal at this point would be to:

Have users load in files as they receive them.
Enter the associations that each file contains.
Review associations holistically, in order to predict or take some action.
Generate a report of the interested associations including the files that the associations are based on.

The value for this project is in the associations, which in reality would grow much more complex then the above examples and should produce interesting conclusions. However. if the User is asked "How did you come to that conclusion", they need to be able to produce a summary of the associations as well as any files that these associations are based on - ie the PDF or EXCEL or whatever.

Initial thoughts...

I also should also add that this applicatoin would be hosted internally, and probably used by approx 50 Users so I probably don't need super-duper, fastest, scalable, high availability possible solution. The data being loaded could get rather large though, maybe up to a terabyte in a year? (Not the associations but the actual files)

Wouldn't it be great if Neo4J just did all of this! Obviously it should handle the graph aspects of this very nicely, but I figure that the file storage and text search is going to need another player added to the mix.

Some combinations of solutions I know of would be:

Store EVERYTHING including files as binary in Neo4J.

Would be wrestling Neo4J for something its not built for. How would I search text?
Store only associations and meta data in Neo4J and uploaded file on File system.

How would I do text searches on files that are stored on file server?
Store only associations and meta data in Neo4J and uploaded file in Postgres.

Not so confident of having all my files inside DB. Feel more comfortable having all my files accessible in folders.

Everyone says its great to put your files in DB. Everyone says its not great to put your files in DB.

Get to the bloody questions..

Can anyone suggest a good "stack" that would suit the above?
Please give a basic outline on how you would implement your suggestion, ie:
- Have the application store the data into Neo4J, then use triggers to update Postgres.
- Or have the files loaded into Postgres and triggers update Neo4J.
- Or Have the application load data to Nea4J and then application loads data into Postgres.
- etc

How you would tie these together is probably what I am really trying to grasp.

Thank you very much for any input on this.

Cheers.

p.s. What a ramble! If you feel the need to edit my question or title to simplify, go for it! :)

I'm not sure this question is a good fit for the site since there's no "right" answer. My 2c: use something like [Apache Tika](https://tika.apache.org/) with your own extensions to extract indexable data from binaries **once** at ingest time, upload files to AWS S3 (or similar), Solr or ElasticSearch for free-text, Neo4j for relational search. — Mikesname, Dec 29 '15 at 11:02
@Mikesname, Sorry, I'm still struggling to articulate exactly what I need to ask due to my lack of familiarity. Tks for the Tika suggestion. I probably need to start with understanding how to store my files and use Neo4J for relations - before I get into parsing the actual file itself. I have edited my question to reflect this. — Oscar, Dec 29 '15 at 13:08

score 3 · Accepted Answer · edited May 23 '17 at 10:28

Here's my recommendations:

Never store binary files in the database. Store in filesystem or a service like AWS S3 instead and reference the file in your data model.
I would store the file first in S3 and a reference to it in your primary database (Neo4j?)
If you want to be able to search for any word in a document I would recommend using a full text search engine like Elastic Search. Elastic Search can scan multiple document formats like PDF using Tika.
You can probably also use Elastic/Tika to search for relationships in the document and surface them in order to update your graph.

Suggested Stack:

Neo4j
ElasticSearch
AWS S3 or some other redundant filesystem to avoid data loss

Bonus: See this SO question/answer for best practices on indexing files in multiple formats using ES.

Thanks #albertoperdomo. Checked out those resources and they look like a good fit for what I need. — Oscar, Jan 05 '16 at 01:27

Neo4J: Binary File storage and Text Search "stack"

Initial thoughts...

Get to the bloody questions..

1 Answers1