0

I'm pretty new to graph databases and neo4j in general but trying to solve a problem using this approach.

I have two datasets, one with addresses and DOC_ID where many addresses are related to document through is_in relationship. Many addresses can be in each DOC_ID.

Second dataset has home owners and DOC_ID where again owners have is_in relationship with each doc id.

Dataset 1 looks like below.

      address               DOC_ID        
1    123 Rock Rd            0134
2.   456 John Drive         2157
3.   789 York St.           9871
4.   927 Farm Ct.           9871
...

Similarly for Dataset 2:

      owner               DOC_ID        
1    John D.                0134
2.   Sarah Cote             2157
3.   Jack Ma                9871
...

Query I used to load these data in:

# Dataset 1 with addresses
LOAD CSV WITH HEADERS FROM 'file:///addresses.csv' AS row
WITH row WHERE row.address IS NOT NULL
MERGE (l:location {address: row.address, doc_id:row.DOC_ID})
MERGE (d:doc {doc_id:row.DOC_ID})
MERGE (l)-[r:is_in]->(d)

# Dataset 2 with home owners
LOAD CSV WITH HEADERS FROM 'file:///owners.csv' AS row
WITH row WHERE row.owner IS NOT NULL
MERGE (n:home_owner {name: row.owner, doc_id:row.DOC_ID})
MERGE (d:doc {doc_id:row.DOC_ID})
MERGE (n)-[r:is_in]->(d)

The addresses dataset contains address of the owner as well as other irrelevant addresses. Also the same owners may appear under different names but I know they should have the same address. What I'm ultimately trying to do is uncover owner entities operating under different names.

I can reason certain things, like an address will appear relatively few times in the dataset compared with the address of the owner, which should appear in most of the documents associated with that owner. So I wrote this query below that would help sort out some owner addresses. But I am getting 0 results from this query.

MATCH(n:home_owner-[f:is_in]->(d:doc)<-[r:is_in]-(l:location)
WITH l, n, count(r) as rct
WHERE rct > 20
RETURN l, n

I am wondering if this is a problem with my queries, or if my logic is incorrect. Or is there a better way to solve this problem in neo4j? Any help/insights would be greatly appreciated.

amnesic
  • 259
  • 1
  • 7

1 Answers1

1

The 2nd line: WITH l, n, count(r) as rct is the problem. When you do an aggregate, it will include l and n in the grouping (similar to SQL group by). Thus, you are not getting the correct rct count because each of location AND owner are counted one in the grouping. This is the correct query for your question.

MATCH (n:home_owner)-[r:is_in]->(:doc)<-[:is_in]-(:location)
//find the homeowners with more than 20 address
WITH n, count(r) as rct WHERE rct > 20
//find the location of those homeowners
MATCH (n)-[:is_in]->(:doc)<-[:is_in]-(l:location)
RETURN n, l

Secondly, there are typo errors on your loading scripts. I corrected them now.

Dataset 1 with addresses
old: MERGE (d:doc {DOC_ID})
correct: MERGE (d:doc {doc_id:row.DOC_ID})
Dataset 2 with home owners
old: MERGE (n)-[r:is_in]->(b)
correct: MERGE (n)-[r:is_in]->(d)
jose_bacoy
  • 12,227
  • 1
  • 20
  • 38