Traversing through all nodes and comparing each one with every other one

Question

I am working on a little project and I have a dataset of about 60k nodes and 500k relationships between those nodes. The nodes are of two types. First type are are recipes and the second type are ingredients. Recipes are composed of ingredients like:

    (ingredient)-[:IS_PART_OF]->(recipe)

My objective is to find how many common ingredients two recipes share. I have managed to obtain this information with the following query that compares one recipe to all others (the first one with all others):

   MATCH (recipe:RECIPE{ ID: 1000000 }),(other)
   WHERE (other.ID >= 1000001 AND other.ID <= 1057690)
   OPTIONAL MATCH (recipe:RECIPE)<-[:IS_PART_OF]-(ingredient:INGREDIENT)-                 [:IS_PART_OF]->(other)
   WITH ingredient, other
   RETURN other.ID, count(distinct ingredient.name)
   ORDER BY other.ID DESC

My first question: How can I obtain the number of all ingredients of two recipes in a way that the mutual ones are counted only once (union of R1 and R2 --> R1 U R2)

My second question: is it possible to write a loop that would iterate through all the recipes and check for common ingredients? The objective is to compare each recipe with all others. I think this should return (n-1)*(n/2) rows.

I have tried the above and the problem remains. Even with LIMIT and SKIP I can not run the code on the whole set. I have changed my query so it allows me to partition my set accordingly:

MATCH (recipe1)<-[:IS_PART_OF]-(ingredient:INGREDIENT)-[:IS_PART_OF]->(recipe2)
WHERE (recipe2.ID >= 1000000 AND recipe2.ID <= 1000009) AND (recipe1.ID >=   1000000 AND recipe1.ID <= 1000009) AND (recipe1.ID < recipe2.ID)
RETURN recipe1.ID, count(distinct ingredient.name) AS MutualIngredients, recipe2.ID
ORDER BY recipe1.ID

Until I get my hands on a better machine this will suffice.

I still haven't solved my first question: how can I obtain the number of all ingredients of two recipes in a way that the mutual ones are counted only once (union of R1 and R2 --> R1 U R2)

FrobberOfBits · Accepted Answer · 2015-07-05T17:30:30.640

2

You'll need to play with this, but it's going to be something similar to this:

MATCH (recipe1:RECIPE)<-[:IS_PART_OF]-(ingred:INGREDIENT)-[:IS_PART_OF]->(recipe2:RECIPE)
WHERE ID(recipe1) < ID(recipe2)
RETURN recipe1, collect(ingred.name), recipe2
ORDER BY recipe1.ID

The match pattern gets you all of the common ingredients between two recipes. The WHERE clause ensures that you're not comparing a recipe to itself (because it would share all ingredients with itself). The return clause just gives you the two recipes you're comparing, and what they have in common.

This will be O(n^2) though, and will be very slow.

UPDATE took Nicole's suggestion, which is a good one. That should guarantee each pair is only considered once.

edited Jul 05 '15 at 17:30

answered Jul 05 '15 at 14:47

FrobberOfBits

17,634
4
52
86

1

You should add `WHERE ID(recipe1) < ID(recipe2)` – Nicole White Jul 05 '15 at 15:57
Thank you for your answers and suggestions. I have tried running the query and in runs for a while and then terminates sometimes because of java heap error, GC error or the DB gets disconnected (I have checked on-line and it is also a GC error). I think the operation is just too intense for my computer. I have tried increasing the heap space and some other tweaks but without success. Is it possible to divide the operation into several smaller ones without altering the data itself and still obtain the desired result? – matjaz.moser Jul 05 '15 at 17:38
Yes; use `LIMIT` and `SKIP` to break this query up into "pages" of results. If you have a lot of recipes and ingredients, it's not suprising that it would take a lot of memory.https://stackoverflow.com/questions/16338670/how-to-paginate-query-results-with-cypher – FrobberOfBits Jul 05 '15 at 17:39
1

For example if your query is `MATCH (n) RETURN n` you can instead do `MATCH (n) RETURN n LIMIT 10 SKIP 10` and that will return the second set of 10 results, so you can break them into chunks of whatever size. – FrobberOfBits Jul 05 '15 at 17:40

score 0 · Answer 2 · edited Sep 07 '15 at 17:02

SOLVED: Just to share it if someone else will need it:

    MATCH (recipe1)<-[:IS_PART_OF]-(ingredient:INGREDIENT)-[:IS_PART_OF]->(recipe2)
    MATCH (recipe1)<-[:IS_PART_OF]-(ingredient1:INGREDIENT)
    MATCH (recipe2)<-[:IS_PART_OF]-(ingredient2:INGREDIENT)
    WHERE (recipe2.ID >= 1000000 AND recipe2.ID <= 1000009) AND (recipe1.ID >=   1000000 AND recipe1.ID <= 1000009) AND (recipe1.ID < recipe2.ID)
    RETURN recipe1.ID, count(distinct ingredient1.name) + count(distinct ingredient2.name) - count(distinct ingredient.name)  AS RecipesUnion, recipe2.ID
    ORDER BY recipe1.ID

Traversing through all nodes and comparing each one with every other one

2 Answers2