1

I want to create a vocabulary graph with word vectors. The aim is to query for nearest word in vocabulary graph based on word similarity. How can we achieve this on neo4j?

The following is an example:

Suppose vocabulary consists of the following:

Product Quality
Wrong Product
Product Price
Product Replacement

And query word is: Affordable Product

In a single query I should be able to figure out that "Affordable Product" is more closely related to "Product Price" than any others.

Please note that I am storing word embedding in the graph, and hence cosine similarity check on each of the words in the vocabulary one by one will help me achieve this. However when vocabulary becomes large, querying one by one hinders speed and performance.

If there is any way to store the word embeddings for domain vocabulary as a graph, which can be queried for nearest node based on cosine similarity, it can be a possible solution. However not been able to find any thing like this so far.

Looking forward for pointers if any as well. Thanks

gojomo
  • 52,260
  • 14
  • 86
  • 115
buddy
  • 189
  • 2
  • 16
  • Your query term, `'Affordable Product'`, is not in your vocabulary. (Technically, not even `'Product'` is, as you've only mentioned tokens that include 2 words.) Setting aside any speed concerns or possible specifics (like a graph database) for a moment, how were you expecting the code to rate the unknown term `'Affordable Product'` closest to `'Product Price'`? (Do you have an existing solution for this, which is just too slow, prompting your query about constructing a graph? – gojomo May 16 '20 at 18:05
  • I will store cosine distance to all other nodes in the graph. This was the roughest implementation I could think of. – buddy May 18 '20 at 04:05
  • What nodes in what graph? Setting aside any speed concerns or possible specifics (like a graph structure) for a moment, how were you expecting the code to rate the **unknown** term 'Affordable Product' closest to 'Product Price'? (Do you have an existing solution for this, which is just too slow, prompting your query about constructing a graph? Having a way to do this unknown-to-known comparison, even without any graph structure optimizations, is a prerequisite for considering other approaches like a precalculated nearest-neighbor(s) graph.) – gojomo May 18 '20 at 18:34
  • I ve cosine distance to find nearness. I can do that against all nodes in the graph space to find the nearest – buddy May 19 '20 at 18:00
  • If the term '`Affordable Product'` (that exact two-word string) is not in your vocabulary, there's vector for it to calculate a cosine-distance. – gojomo May 19 '20 at 23:00
  • I did not understand. Was that a question? – buddy May 20 '20 at 16:30
  • Er, that should have said, "there's **no* vector for it to calculate a cosine-distance." Yes, in order to understand your question & goals, I still have the questions: "Setting aside any speed concerns or possible specifics (like a graph database) for a moment, how were you expecting the code to rate the unknown term 'Affordable Product' closest to 'Product Price'?" And: "Do you have an existing solution for this, which is just too slow, prompting your query about constructing a graph?" – gojomo May 20 '20 at 16:39
  • I'm assuming you have a starting vocabulary (which you've mentioned), for which there are vectors (which you've mentioned). But you have no graph yet (which you're asking about), & it's not yet clear how having a graph would solve your example problem (because having a graph, alone, doesn't help make an unknown term with no vector like `'Affordable Product'` comparable to a known-term like `'Product Price'`). If that's the issue, graphs aren't relevant - but it'd be good to know what you've tried. But if there's some other real reason for graph, that'd be good to know, too. – gojomo May 20 '20 at 16:40
  • In fact the real reason I m looking for a graph is to see if I can query the nearest label quickly for a new label. Else I ve to compare one by one which takes a lot of time – buddy May 20 '20 at 23:17
  • But if it's a `new label`, it's in neither the full vocabulary of vectors, nor the graph. Even if somehow, in some unspecified way, you bootstrap a vector for it, it's still not in the graph. If you want to find the one existing word that's closest, that's a full check against all known words. Once you've done that, there's no need for a cache graph of each words' nearest neighbor(s) - you've already done the hard part. (But also: if bulk "one-by-one" comparison seems too time-consuming, you might be doing it poorly - hence my Q about what you've tried and why you've concluded it's too slow.) – gojomo May 21 '20 at 00:45
  • There's ways to do the bulk comparison better! There's other optimizations for other purposes. But to make a good suggestion, I need to know where you're really at - what's the data you've got, what have you tried & found too-slow, what's the real ultimate goal. Not more details of one hypothesized solution, "a vocabulary graph". That might have some uses - but doesn't seem to fit the hints about your need. It's not commonly used for fast "nearest... by cosine-similarity" lookups, except in some extreme situations trading away precise results, because of challenges in high-dimensional spaces. – gojomo May 21 '20 at 00:50
  • I am yet to fully delve into trying out different options for this approachch. Quite a newbie with neo4j. I am open to other options as well, if any – buddy May 22 '20 at 17:04
  • Have multiple options I could suggest if you could further explain the details of needs & things-tried I've requested. – gojomo May 22 '20 at 18:10
  • Let me tell what I am trying to achieve, which might help in our discussion. I am trying to label short text with the most appropriate topic from a vocabulary of topics. I am trying to come up with a scalable model which I can create and use for any domain. So far what I have been doing was to maintain a list of topics, and compare one by one to sentences I have to find which topic matches the best. However as the vocabulary becomes big, this is a very slow process. – buddy May 22 '20 at 23:51
  • The reason for exploring neo4j was to see if I can create a curated vocabulary of any domain, and if I can find the nearest tag to a sentence with a single query. As quite new in this, I am not sure if the approach is right or the direction to pursue fully. Do let me know if any other info would help – buddy May 22 '20 at 23:58
  • What you've described seems to be a text-classification process: you have texts, you want them labeled from some fixed set of categories. (Maybe each text gets one label, maybe it gets a few with varying confidence weights.) Dense word/doc vectors, like from Word2Vec or Doc2Vec, *might* wind up as a helpful part of a solution, but pre-loading a graph database likely won't. I suggest finding online examples of text-classification that seem similar to your need (whether they use word-vectors or not), and adapting those. – gojomo May 23 '20 at 02:39

2 Answers2

1

What you want to do is to store your embedding results into the graph. Next step is to use Neo4j Graph Data Science library, and run specifically cosine similarity algorithm. It should look something along the lines of:

MATCH (p:Word)
 WITH {item:id(p), weights: p.embedding} AS wordData
 WITH collect(wordData) AS data
 CALL gds.alpha.similarity.cosine.write({
   nodeProjection: '*',
   relationshipProjection: '*',
   data: data,
   // here is where you define how many nearest neighbours should be stored
   topK: 1,
   // here you define what is the minimal similarity between a 
   // given pair of node to be still relevant
   similarityCutoff: 0.1
 })
 YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
 RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95

You have now preprocessed your nearest neighbors and can easily query them like:

MATCH (w:Word)-[:SIMILAR]-(other)
RETURN other

Hope this helps, let me know if you have any other questions.

Tomaž Bratanič
  • 6,319
  • 2
  • 18
  • 31
0

After tryout and reading our several options I found that https://github.com/facebookresearch/faiss is the best option for this use case.

buddy
  • 189
  • 2
  • 16