Both on the surface looks like we generate a low dimension representation of texts by hashing or vectoring them, were similar vectors will lie close in the vector space if embedded (in the embedding case) and similar hashes will be in the same bucket (in the LSH case). How are these different ? What am I missing ?
Asked
Active
Viewed 41 times
0
-
Your question is hard to parse: do you mean "similarity search" where a "text" (=document) is used as a query? Also, what do you mean by "similarity"? Word similarity? Semantic similarity? Relevance to a query? Please add some context, e.g. are we indexing documents or searching and ranking them? – Miro Lehtonen Sep 11 '19 at 04:52
-
In NLP lit. there is a general consensus on the usage of "text", it is a catch-all term used to mean set of characters or words or a sentence or a phrase or a paragraph or a long-form document depending on the task at hand. There is also a general consensus on the term "Similarity search" it is a shorthand for semantic textual similarity (STS), so similarity here is semantic similarity. Also note word embeddings not only work at word level it can be used at sentence or doc level with complex arch like RNNs (and cousins) or transformers. But I see your point, rephrasing...... – Prithiviraj Damodaran Sep 12 '19 at 05:14