Cypher query to reveal same entity through relationships

Question

I'm pretty new to graph databases and neo4j in general but trying to solve a problem using this approach.

I have two datasets, one with addresses and DOC_ID where many addresses are related to document through is_in relationship. Many addresses can be in each DOC_ID.

Second dataset has home owners and DOC_ID where again owners have is_in relationship with each doc id.

Dataset 1 looks like below.

      address               DOC_ID        
1    123 Rock Rd            0134
2.   456 John Drive         2157
3.   789 York St.           9871
4.   927 Farm Ct.           9871
...

Similarly for Dataset 2:

      owner               DOC_ID        
1    John D.                0134
2.   Sarah Cote             2157
3.   Jack Ma                9871
...

Query I used to load these data in:

# Dataset 1 with addresses
LOAD CSV WITH HEADERS FROM 'file:///addresses.csv' AS row
WITH row WHERE row.address IS NOT NULL
MERGE (l:location {address: row.address, doc_id:row.DOC_ID})
MERGE (d:doc {doc_id:row.DOC_ID})
MERGE (l)-[r:is_in]->(d)

# Dataset 2 with home owners
LOAD CSV WITH HEADERS FROM 'file:///owners.csv' AS row
WITH row WHERE row.owner IS NOT NULL
MERGE (n:home_owner {name: row.owner, doc_id:row.DOC_ID})
MERGE (d:doc {doc_id:row.DOC_ID})
MERGE (n)-[r:is_in]->(d)

The addresses dataset contains address of the owner as well as other irrelevant addresses. Also the same owners may appear under different names but I know they should have the same address. What I'm ultimately trying to do is uncover owner entities operating under different names.

I can reason certain things, like an address will appear relatively few times in the dataset compared with the address of the owner, which should appear in most of the documents associated with that owner. So I wrote this query below that would help sort out some owner addresses. But I am getting 0 results from this query.

MATCH(n:home_owner-[f:is_in]->(d:doc)<-[r:is_in]-(l:location)
WITH l, n, count(r) as rct
WHERE rct > 20
RETURN l, n

I am wondering if this is a problem with my queries, or if my logic is incorrect. Or is there a better way to solve this problem in neo4j? Any help/insights would be greatly appreciated.

Based on your sample data in dataset1 and dataset2 above, what is your expected result? Thanks. — jose_bacoy, Sep 02 '22 at 12:07

score 1 · Answer 1 · answered Sep 02 '22 at 12:34

1

The 2nd line: WITH l, n, count(r) as rct is the problem. When you do an aggregate, it will include l and n in the grouping (similar to SQL group by). Thus, you are not getting the correct rct count because each of location AND owner are counted one in the grouping. This is the correct query for your question.

MATCH (n:home_owner)-[r:is_in]->(:doc)<-[:is_in]-(:location)
//find the homeowners with more than 20 address
WITH n, count(r) as rct WHERE rct > 20
//find the location of those homeowners
MATCH (n)-[:is_in]->(:doc)<-[:is_in]-(l:location)
RETURN n, l

Secondly, there are typo errors on your loading scripts. I corrected them now.

Dataset 1 with addresses

old: MERGE (d:doc {DOC_ID})
correct: MERGE (d:doc {doc_id:row.DOC_ID})

Dataset 2 with home owners

old: MERGE (n)-[r:is_in]->(b)
correct: MERGE (n)-[r:is_in]->(d)

answered Sep 02 '22 at 12:34

jose_bacoy

12,227
1
20
38

Sorry about the typos. So I've tried your query and I get a graph of location and owner nodes spread about in a random way with no relationships. Not sure now how to make sense of it. – amnesic Sep 02 '22 at 14:08
give us examples of such bad nodes. Also check your csv files to see if it valid or not. – jose_bacoy Sep 02 '22 at 14:15
can I DM you with some more thoughts on queries? – amnesic Sep 12 '22 at 06:26
how to DM you? LinkedIn or email? – amnesic Sep 12 '22 at 14:43

Cypher query to reveal same entity through relationships

1 Answers1

Dataset 1 with addresses

Dataset 2 with home owners