Group together individuals based on commonalities(direct or indirect) in two other columns for 7.5 million dataset in python

Question

I have a dataframe which contains

cust_id|phone|email 
1   678   a
2   NaN   c
3   987   b
4   456   NaN
5   NaN   d
7   456   c

All the cust_ids with either matching phone or email are directly related.eg.cust_id 1 is directly related to 2 and 2 is directly related to 3.

Cust_id 1 is indirectly related to 3- they don't have same phone or email but are related through 2.

I want to club and give a unique number to a group which is directly related or is indirectly related

Desired output:

Cust_id|phone|email | group_no
1     678   a   1
2     NaN   c   2
3     987   b   3
4     456   NaN 2
5     NaN   d   4
7     456   c   2

Obtained output:

Cust_id|phone|email | group_no

1     678   a    1
2     NaN   c    2
3     987   b    3
4     456   NaN  2
5     NaN   d    2
7     456   c    2

how do i do this for a dataset that has 7.5 million rows without compromising on speed.

I used the following code in the picture. enter image description here

score 0 · Answer 1 · answered Feb 08 '22 at 16:20

0

This looks like the perfect case for using a graph database. If you are interested in that, download Neo4j desktop and we will take it from there. You could google

neo4j-admin bulk import tool

With your database size, I expect it will take about 1 min.

answered Feb 08 '22 at 16:20

Sanjay Singh

367
1
11

Hi sanjay, I am restricted to use only python. Hope you can help. – chaitanya Feb 10 '22 at 09:44
No issues. Use the Networkx library. It will be slower but at least you will be working in pure Python. – Sanjay Singh Feb 10 '22 at 10:23
hi, i used the code in the picture but am getting an error. Can you guide me through this – chaitanya Feb 10 '22 at 12:30
Your graph initialisation seems to be incorrect. The way I see the solution, you need to create three types of nodes (customer, phone, email) and two types of relationships (has_phone, has_email). Then you are on the right track with what you have done. You will call the strongly connected algorithm of networkx and assign component ID to each customer (who will now be linked to a subgraph of multiple phones and emails). I am travelling and if you still need help, I’ll give you the detailed code in a couple of hours. – Sanjay Singh Feb 10 '22 at 12:50
Yeah..that would help a lot – chaitanya Feb 11 '22 at 06:16

score 0 · Answer 2 · answered Feb 11 '22 at 07:29

Here you go Chaitanya

# custom class sub-classed from Graph for low-memory requirements as you don't need edge weights
    class ThinGraph(nx.Graph):
        all_edge_dict = {"weight": 1}
        def single_edge_dict(self):
            return self.all_edge_dict
            edge_attr_dict_factory = single_edge_dict
    # create a blank Graph object from the Networkx library
    G = ThinGraph()
    # add nodes to G
    nodes_array =[]
    customer_id_phone_edges_array = []
    customer_id_email_edges_array = []
    for _,row in df.iterrows():
        nodes_array.append((row["cust_id"],{"label":"Customer ID"}))
        nodes_array.append((row["phone"],{"label":"Phone"}))
        nodes_array.append((row["email"],{"label":"Email"}))
        # Create your customer_ID, phone graph edges
        customer_id_phone_edges_array.append((row["cust_id"],row["phone"]))
        # Create your customer_ID, email graph edges
        customer_id_email_edges_array.append((row["cust_id"],row["email"]))
    #Add the nodes and edges
    G.add_nodes_from(nodes_array)
    G.add_edges_from(customer_id_phone_edges_array)
    G.add_edges_from(customer_id_email_edges_array)
    # delete objects to free up memory
    del nodes_array
    del customer_id_phone_edges_array
    del customer_id_email_edges_array
    # run the connected components algorithm
    components = nx.connected_components(G)
    comp_dict = {idx: comp for idx, comp in enumerate(components)}
    attr = {n: comp_id for comp_id, nodes in comp_dict.items() for n in nodes}
    nx.set_node_attributes(G,  attr,"component")
    # Examine the result
    G.nodes(data=True)

For your given data, you will get the output

NodeDataView({1: {'label': 'Customer ID', 'component': 0}, 678: {'label': 'Phone', 'component': 0}, 'a': {'label': 'Email', 'component': 0}, 2: {'label': 'Customer ID', 'component': 0}, 'b': {'label': 'Email', 'component': 0}, 3: {'label': 'Customer ID', 'component': 0}, 987: {'label': 'Phone', 'component': 0}, 4: {'label': 'Customer ID', 'component': 1}, 456: {'label': 'Phone', 'component': 1}, 'd': {'label': 'Email', 'component': 1}, 5: {'label': 'Customer ID', 'component': 1}, 654: {'label': 'Phone', 'component': 1}, 7: {'label': 'Customer ID', 'component': 1}, 'f': {'label': 'Email', 'component': 1}})

In fact you should optimise memory usage by doing the df iteration thrice, adding the resulting node/edge array and deleting it. The Graph object is stored in-memory. — Sanjay Singh, Feb 11 '22 at 07:43

Group together individuals based on commonalities(direct or indirect) in two other columns for 7.5 million dataset in python

2 Answers2