I'm pretty new to graph databases and neo4j in general but trying to solve a problem using this approach.
I have two datasets, one with addresses
and DOC_ID
where many addresses are related to document through is_in
relationship. Many addresses can be in each DOC_ID
.
Second dataset has home owners
and DOC_ID
where again owners
have is_in
relationship with each doc id.
Dataset 1 looks like below.
address DOC_ID
1 123 Rock Rd 0134
2. 456 John Drive 2157
3. 789 York St. 9871
4. 927 Farm Ct. 9871
...
Similarly for Dataset 2:
owner DOC_ID
1 John D. 0134
2. Sarah Cote 2157
3. Jack Ma 9871
...
Query I used to load these data in:
# Dataset 1 with addresses
LOAD CSV WITH HEADERS FROM 'file:///addresses.csv' AS row
WITH row WHERE row.address IS NOT NULL
MERGE (l:location {address: row.address, doc_id:row.DOC_ID})
MERGE (d:doc {doc_id:row.DOC_ID})
MERGE (l)-[r:is_in]->(d)
# Dataset 2 with home owners
LOAD CSV WITH HEADERS FROM 'file:///owners.csv' AS row
WITH row WHERE row.owner IS NOT NULL
MERGE (n:home_owner {name: row.owner, doc_id:row.DOC_ID})
MERGE (d:doc {doc_id:row.DOC_ID})
MERGE (n)-[r:is_in]->(d)
The addresses dataset contains address of the owner as well as other irrelevant addresses. Also the same owners may appear under different names but I know they should have the same address. What I'm ultimately trying to do is uncover owner
entities operating under different names.
I can reason certain things, like an address will appear relatively few times in the dataset compared with the address of the owner, which should appear in most of the documents associated with that owner. So I wrote this query below that would help sort out some owner addresses. But I am getting 0 results from this query.
MATCH(n:home_owner-[f:is_in]->(d:doc)<-[r:is_in]-(l:location)
WITH l, n, count(r) as rct
WHERE rct > 20
RETURN l, n
I am wondering if this is a problem with my queries, or if my logic is incorrect. Or is there a better way to solve this problem in neo4j? Any help/insights would be greatly appreciated.