4

I am using neo4j to solve a realtime normalization problem. Lets say I have 3 places from 2 different sources. 1 source 45 gives me 2 places that are in-fact duplicates of each other, and 1 source 55 gives me 1 correct identifier. However, for any place identifier (duplicate or not), I want to find the closest set of places that are unique by a feed identifier. My data looks like so:

CREATE (a: Place {feedId:45, placeId: 123, name:"Empire State", address: "350 5th Ave", city: "New York", state: "NY", zip: "10118" })
CREATE (b: Place {feedId:45, placeId: 456, name:"Empire State Building", address: "350 5th Ave", city: "New York", state: "NY"})
CREATE (c: Place {feedId:55, placeId: 789, name:"Empire State", address: "350 5th Ave", city: "New York", state: "NY", zip: "10118"})

I have connected these nodes by Matching nodes so I can do some normalization on the data. For instance:

MERGE (m1: Matching:NameAndCity { attr: "EmpireStateBuildingNewYork", cost: 5.0 })
MERGE (a)-[:MATCHES]-(m1)
MERGE (b)-[:MATCHES]-(m1)
MERGE (c)-[:MATCHES]-(m1)
MERGE (m2: Matching:CityAndZip { attr: "NewYork10118", cost: 7.0 })
MERGE (a)-[:MATCHES]-(m2)
MERGE (c)-[:MATCHES]-(m2)

When I want to find what are the closest matches from a start place id, I can run a match on all paths from the start node, ranked by cost, ie:

MATCH p=(a:Place {placeId:789, feedId:55})-[*..4]-(d:Place)
WHERE NONE (n IN nodes(p)
        WHERE size(filter(x IN nodes(p)
                          WHERE n = x))> 1)
WITH    p,
        reduce(costAccum = 0, n in filter(n in nodes(p) where has(n.cost)) | costAccum+n.cost) AS costAccum
        order by costAccum
RETURN p, costAccum

However, as there are multiple paths to the same places, I get the same node replicated multiple times when querying like this. Is it possible to collect the nodes and their costs, and then only return a distinct subset (for e.g., give me the best result from feed 45 and 55?

How could I return a distinct set of paths, ranked by cost, and unique by the feed identifier? Am I structuring this type of problem wrong?

Please help!

eurobrew
  • 77
  • 4
  • could you provide a more complete sample graph using http://console.neo4j.org? – Stefan Armbruster Apr 25 '15 at 20:11
  • Not complitly sure on what you want, but you do know the "Order by" and "Distinct? (http://neo4j.com/docs/stable/query-order.html#order-by-order-nodes-in-descending-order and http://neo4j.com/docs/stable/query-aggregation.html#aggregation-distinct)? – Thomas Apr 26 '15 at 21:46

1 Answers1

0

You can collect all paths for each place d, and then just take the best path in each collection (since they will be sorted then collected)

MATCH p=(a:Place {placeId:789, feedId:55})-[*..4]-(d:Place)
WITH d, collect(p) as paths,
        reduce(costAccum = 0, n in filter(n in nodes(p) where has(n.cost)) | costAccum+n.cost) AS costAccum
        order by costAccum
RETURN head(paths) as p, costAccum
Tezra
  • 8,463
  • 3
  • 31
  • 68