I have a schema which looks like below:
A
customer
is linked to anothercustomer
with a relationshipSIMILAR
havingsimilarity score
.Example:
(c1:Customer)-->(c2:Customer)
An
Email
node is connected to eachcustomer
with relationshipMAIL_AT
with the following node properties:
{
"active_email_address": "a@mail.com",
"cibil_email_addresses": [
"b@mail.com", "c@mail.com"
]
}
Example: (e1:Email)<-[:MAIL_AT]-(c1:Customer)-[:SIMILAR]->(c2:Customer)-[:MAIL_AT]->(e2:Email)
- A
Risk
node with some risk-related properties (below) and is related tocustomer
with relationshipHAS_RISK
:
{
"f0_score": 870.0,
"pta_score": 430.0
}
- A
Fraud
node with some fraud-related properties (below) and is related tocustomer
with relationshipIS_FRAUD
:
{
"has_commited_fraud": true
}
My Objectives:
- To find the customers with common email addresses (irrespective of active and secondary)?
My tentative solution:
MATCH (email:Email)
WITH email.cibil_email_addresses + email.active_email_address AS emailAddress, email
UNWIND emailAddress AS eaddr
WITH DISTINCT eaddr AS deaddr, email
UNWIND deaddr AS eaddress
MATCH (customer:Customer)-[]->(someEmail:Email)
WHERE eaddress IN someEmail.cibil_email_addresses + someEmail.active_email_address
WITH eaddress, COLLECT(customer.customer_id) AS customers
RETURN eaddress, customers
Problem: It is taking forever to execute this. Working with lists will take time I understand, however, I'm flexible to change the schema (If suggested). Should I break the email address into separate nodes? If yes, then how do I break cibil_email_addresses
into different nodes as they can vary - Should I create two nodes with different cibil email addresses
and connect both of them to customer with relationship HAS_CIBIL_EMAIL
? (Is this a valid schema design). Also, it is possible, a customer's active_email_address
is present in other customer's cibil_email_address
. I'm trying to find a synthetic identity attack. PS: If there exists some APOC
that can help achieve this and below, do suggest with example.
In production, for a given customer with
email addresses
,risk values
,similarity score
, and also given other customers may or may not be tagged withfraud_status
, I want to check whether this new person will fall in a fraud ring or not. PS: If I need to use anygds
to solve this, please suggest with examples.If I were to do this same exercise with some other node such as
Address
which may be partially matching and will be having same list of historical addresses in a list, what should be my ideal approach?
I know, I'm tagging someone in my question, but that person only seems to be active with respect to Cypher on StackOverflow. @cybersam any help? Thanks.