1

We have a large collection of PFD, Word file and PPT of more than 16 MB size stored inside MongoDB GridFS. We need to perform content search on those documents and list the document which have the content. We have tried searching documents after retrieval and text extraction but that is very slow and not feasible since the number of documents will keep growing over the time.

Is there any other way we can achieve that? Have already searched SO for similar topic including below one but nothing helped so far -

Full-text search on MongoDB GridFS?

We have also tried for alternatives like elasticsearch however couldn't find any updated reference and example most of the available information is out of date and not updated. Any pointers will really help.

Jeet
  • 5,569
  • 8
  • 43
  • 75
  • GridFS stores files as binary blobs and MongoDB has no idea what format the binary represents. There's no way to search the contents of files in GridFS. If you store the extracted text too, then you could query it using $text or Atlas Search, but there's no way to query binary blobs in GridFS directly. – tfogo May 10 '21 at 21:13
  • You mean to say there is no solution for this use case? for this kind of use case we should avoid using GridFS? isn't the purpose of GridFS to store large binary file? I am curious because searching seems a common requirement to me. – Jeet May 12 '21 at 04:03
  • GridFS cannot search the contents of a file. You can search files by metadata like name, type, or any user-specified metadata. This is very similar to regular filesystems that map file paths to blocks of data. Generally it is not expected that a filesystem will be able to search the contents of a file without some extra indexing layer. – tfogo May 12 '21 at 17:03
  • In general using GridFS is pretty niche. Usually people will store files on a regular file server or an object store like S3, and store the metadata in MongoDB. Those won't search file contents without some extra indexing either. You can read the GridFS docs for more info on exactly when GridFS is a good choice: https://docs.mongodb.com/manual/core/gridfs/#when-to-use-gridfs 

I think ElasticSearch has an "attachment" processor that might help you do what you want - but I'm not very familiar with ElasticSearch. – tfogo May 12 '21 at 17:04

0 Answers0