I want to store all tags against the document in which they appeared and make it searchable by some other service/client. Scale:
- 10 Billion search query per day
- 10 Million New tags CRUD per day (deleted from doc or appended to doc)
So suppose "hello" appeared in 10 million documents. So when a user does the query for "hello", I want to return the list of document_ids in which it occurred.
What should I do for the data modelling for the same?
option 1: use key: value NoSQL like dynamodb
key: "hello"
value: [doc_id1, doc_id2, .......]
Issues: whenever there is a change in any document related to this tag, we have to read the real value and make the changes.
option 2: storing in individual rows and using something like MongoDB
"hello": doc_id1
"hello": doc_id2
Issue: suppose when doc_id122 removes the "hello" tag then we will have to fetch all entries to delete this one as our database will be shared on tag_name
option3 : column based (e.g Cassandra)
option 4: elastic search
An extensive requirement for the same is: that
- we want to support the autosuggest on the tag in our tag service.
- return according to some ranking (we can't return 1 million in the first go) so return the first 50 most popular documents (can be most viewed, most clapped). I think elastic search internally gives the option to rank documents higher based on Tg-IDF algorithm