2

How could I imply restrictions on variable length path?

I have all possible paths from some start node query:

CREATE INDEX ON :NODE(id)
MATCH all_paths_from_Start = (start:Person)-[:FRIENDSHIP*1..20]->(person:Person)
WHERE start.id = 128 AND start.country <> "Uganda"
RETURN paths;

No I want filter out all paths which have at least two persons with the same country. How could I do that?

VB_
  • 45,112
  • 42
  • 145
  • 293

2 Answers2

3

1) Get an array of countries to the path with possible duplicates: REDUCE

2) Remove duplicates and compare the sizes of arrays: UNWIND + COLLECT(DISTINCT...)

MATCH path = (start:Person)-[:FRIENDSHIP*1..20]->(person:Person)
      WHERE start.id = 128 AND start.country <> "Uganda"
WITH path, 
     REDUCE(acc=[], n IN NODES(path) | acc + n.country) AS countries
     UNWIND countries AS country
WITH path, 
     countries, COLLECT(DISTINCT country) AS distinctCountries
     WHERE SIZE(countries) = SIZE(distinctCountries)
RETURN path

P.S. REDUCE can be replaced by EXTRACT (thanks to Gabor Szarnyas):

MATCH path = (start:Person)-[:FRIENDSHIP*1..20]->(person:Person)
      WHERE start.id = 128 AND start.country <> "Uganda"
WITH path, 
     EXTRACT(n IN NODES(path) | n.country) AS countries
     UNWIND countries AS country
WITH path, 
     countries, COLLECT(DISTINCT country) AS distinctCountries
     WHERE SIZE(countries) = SIZE(distinctCountries)
RETURN path

P.P.S. Thanks again to Gabor Szarnyas for another idea for simplifying the query:

MATCH path = (start:Person)-[:FRIENDSHIP*1..20]->(person:Person)
      WHERE start.id = 128 AND start.country <> "Uganda"
WITH path
     UNWIND NODES(path) AS person
WITH path, 
     COLLECT(DISTINCT person.country) as distinctCountries
     WHERE LENGTH(path) + 1 = SIZE(distinctCountries)
RETURN path
stdob--
  • 28,222
  • 5
  • 58
  • 73
  • thank you for your answer! Will it be the best choice from performance perspective? I will have a very big dataset, with up to million of nodes. Is there any way to make the query faster? – VB_ Jan 29 '17 at 20:49
  • 1
    @VolodymyrBakhmatiuk The heaviest part of the query is the first `MATCH`: `MATCH path = (start:Person)-[:FRIENDSHIP*1..20]->(person:Person) WHERE start.id = 128 AND start.country <> "Uganda"`. Currently there are not quite understand how to improve it ... – stdob-- Jan 29 '17 at 21:01
  • 1
    at least I can index `id` and `coutnry`. Pls let me know if you will have any ideas – VB_ Jan 29 '17 at 21:04
  • @VolodymyrBakhmatiuk The `id` and the `country`, of course, be sure to be indexed. – stdob-- Jan 29 '17 at 21:06
  • 2
    @stdob-- I think we might get along without `extract`: `UNWIND NODES(path) AS node WITH path, COLLECT(node.country) AS countries, COLLECT(DISTINCT node.country) AS distinctCountries` I'm not sure if it will have a measurable effect on performance, but it may be easier to read. – Gabor Szarnyas Jan 29 '17 at 22:55
  • 1
    @GaborSzarnyas Yes you are right. Added an option based on your ideas. – stdob-- Jan 29 '17 at 23:17
2

One solution that I can think of is to get the nodes of the path, and for each person on the path, extract the value of the number of persons from the same country (which we determine by filtering for the same country. A path has persons from unique countries if it has zero persons from the same country, i.e. for all persons, there is only a single person (the person himself/herself) from that country.

MATCH p = (start:Person {id: 128})-[:FRIENDSHIP*1..20]->(person:Person)
WHERE start.country <> "Uganda"
WITH p, nodes(p) AS persons
WITH p, extract(p1 IN persons | size(filter(p2 IN persons WHERE p1.country = p2.country))) AS personsFromSameCountry
WHERE length(filter(p3 IN personsFromSameCountry WHERE p3 > 1)) = 0
RETURN p

The query is syntactically correct but I didn't test it on any data.

Note that I moved the id = 128 condition to the pattern and shortened the all_paths_from_Start variable as p.

Gabor Szarnyas
  • 4,410
  • 3
  • 18
  • 42