2

I want to store all tags against the document in which they appeared and make it searchable by some other service/client. Scale:

  • 10 Billion search query per day
  • 10 Million New tags CRUD per day (deleted from doc or appended to doc)

So suppose "hello" appeared in 10 million documents. So when a user does the query for "hello", I want to return the list of document_ids in which it occurred.

What should I do for the data modelling for the same?

option 1: use key: value NoSQL like dynamodb

key: "hello"
value: [doc_id1, doc_id2, .......]

Issues: whenever there is a change in any document related to this tag, we have to read the real value and make the changes.

option 2: storing in individual rows and using something like MongoDB

"hello": doc_id1
"hello": doc_id2

Issue: suppose when doc_id122 removes the "hello" tag then we will have to fetch all entries to delete this one as our database will be shared on tag_name

option3 : column based (e.g Cassandra)

option 4: elastic search

An extensive requirement for the same is: that

  1. we want to support the autosuggest on the tag in our tag service.
  2. return according to some ranking (we can't return 1 million in the first go) so return the first 50 most popular documents (can be most viewed, most clapped). I think elastic search internally gives the option to rank documents higher based on Tg-IDF algorithm
Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
  • So what is your question exactly? – Erick Ramirez Jun 25 '22 at 23:18
  • Using any data as fields/keys will only lead to future query pain. Something like `{"token": "hello", "inDocs": [ doc_id1, doc_id2, ...]}` would be better. – rickhg12hs Jun 26 '22 at 18:19
  • but there are few probems with that. Suppose you shard the data by key token. then 1. how would you handle hot tokens 2. when a doc adds or removes some famous token, then there will be a lot of queries for this token and we have to parse it first then we need to add/remove it accordingly 3. what if the list becomes very large for one token – Hema Pushpa Jun 27 '22 at 11:07

0 Answers0