1

I am a beginner in using Python. I have a MS Access database containing a family tree. I wish to use Python to query the family tree to list all the descendants of identified ancestors. The family tree is held in a table I have exported to excel. It has three columns: Relationship ID, Node1_ID, Node2_ID. Relationships listed in each row are uni-directional from Node1_ID to Node2_ID. The tree is non-binary, whereby the tree can have several Node2_IDs descending from a Node1_ID.

Each partner of a marriage in the position of Node1_ID relate to the same partnership node in Node2_ID (in the example below parent Nodes 1 and 2 relate to partnership Node 3). The partnership node then relates to children Nodes (e.g., partnership Node 3 connects to children Nodes 4, 5 and 6).

RID Node1_ID Node2_ID
1 1 3
2 2 3
3 3 4
4 3 5
5 3 6

I wish to use a code to call the excel spreadsheet and allow me to query a list of all descendants from a particular node through recursion. I then wish to write that list to a new excel file that I can import back into my MS Access database or to visualise in other software like GraphViz.

In searching online, I have come across the following code, which seems to do the trick:

import pandas as pd
import sys
sys.setrecursionlimit(100000000)

df = pd.read_excel(r'excelfilepath')
columns = df.columns.ravel()

def get_ancestry_dataframe_flat(df):

    def get_child_list(parent_id):

        list_of_children = list()
        list_of_children.append(df[df['P1ID'] == parent_id]['P2ID'].values)

        for i, r in df[df['P1ID'] == parent_id].iterrows():
            if r['P2ID'] != parent_id:
                list_of_children.append(get_child_list(r['P2ID']))

        # flatten list
        list_of_children = [item for sublist in list_of_children for item in sublist]
        return list_of_children

    new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int)
    for index, row in df.iterrows():
        temp_df = pd.DataFrame(columns=['descendant', 'ancestor'])
        temp_df['descendant'] = pd.Series(get_child_list(row['P1ID']))
        temp_df['ancestor'] = row['P1ID']
        new_df = new_df.append(temp_df)

    new_df = new_df\
        .drop_duplicates()\
        .sort_values(['ancestor', 'descendant'])\
        .reset_index(drop=True)

    return new_df

writer = pd.ExcelWriter('20210408_descendant_relationships.xlsx',engine='xlsxwriter')
get_ancestry_dataframe_flat(df).to_excel(writer,sheet_name='Sheet1')
writer.save()

The problem with this code, is that I seem to hit the recursion limit when creating a list of descendants from all ancestors (The family tree is quite big). I wish to therefore query only descendants from specific ancestors. What edits can I make to do this please? Aside from the recursion limit issue, I will still want to identify descendants from specific ancestors.

I am a beginner in Python, so I would appreciate lamens terms if possible. Thank you for your time.

EDIT: code with advice from @Ajax1234


import pandas as pd
from collections import deque

df = pd.read_excel('input.xlsx')
tree = df.values.tolist()

def get_ancestors(node):
   d, seen = deque([node]), set()
   while d:
      yield (n:=d.popleft())
      seen.add(n)
      d.extend([int(c) for _, b, c in tree if int(b) == int(n)])

r = dict([(next(n:=get_ancestors(i)), list(n)) for i in set([t[1] for t in tree])])
vals = [[[a, i] for i in b] for a, b in r.items()]
df1 = pd.DataFrame([dict(zip(['node', 'descendent'], i)) for i in vals])

print(r)

df1.to_excel('output.xlsx')

The intended output excel table should be something like:

Node Descendant
1 3
1 4
1 5
1 6
2 3
2 4
2 5
2 6
3 4
3 5
3 6
Ihaka
  • 25
  • 4

1 Answers1

0

You can use a breadth-first search for a non-recursive solution:

from collections import deque
tree = [[1, 1, 3], [2, 2, 3], [3, 3, 4], [4, 3, 5], [5, 3, 6]]
def get_ancestors(node):
   d, seen = deque([node]), set()
   while d:
      yield (n:=d.popleft())
      seen.add(n)
      d.extend([int(c) for _, b, c in tree if int(b) == int(n)])

get_ancestors takes in a node and produces all descendants of the node via a breadth-first search. As an example, to get all the descendants of the node1_ids:

r = dict([(next(n:=get_ancestors(i)), list(n)) for i in set([t[1] for t in tree])])

Output:

{1: [3, 4, 5, 6], 2: [3, 4, 5, 6], 3: [4, 5, 6]}

Edit: writing to Excel:

writer = pd.ExcelWriter('20210408_descendant_relationships.xlsx',engine='xlsxwriter')
df = pd.read_excel(writer, 'sheet1')
tree = df.values.tolist()
r = dict([(next(n:=get_ancestors(i)), list(n)) for i in set([t[1] for t in tree])])
vals = [[a, i] for a, b in r.items() for i in b]
df1 = pd.DataFrame([dict(zip(['node', 'descendent'], i) for i in vals])
df1.to_excel(writer)
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • Thank you. That looks like a useful output. How do I read the existing excel file to replace the 'tree' array? And then print the output into a new excel file? – Ihaka Apr 08 '21 at 22:44
  • Thank you @Ajax1234. How do I first read data from an existing excel sheet as a list instead of manually typing out the list as you have done for 'tree'? In using the code you have provided so far, I receive the following error:ValueError: File is not a recognized excel file – Ihaka Apr 08 '21 at 23:46
  • @Ihaka Oh yes, please see my recent edit. – Ajax1234 Apr 08 '21 at 23:51
  • Thank you again. I have updated my code at the bottom of my initial post. I still receive the error message of "ValueError: File is not a recognized excel file". Could you please check if I have the code structure correct based on your advice and if you have advice on the error? – Ihaka Apr 09 '21 at 00:08
  • @Ihaka Make sure you are providing the full path to the excel file as in `dir1/dir2/20210408PythonRecursionTest.xlsx`; you can also save the file and run it in the same directory as your excel file. – Ajax1234 Apr 09 '21 at 00:11
  • I couldn't get passed that error, so I used pd.read_excel to read straight from the original excel file and df.to_excel to write to a new one (please see code edits above). @Ajax1234 The problem now is that the output excel file has not ordered the nodes and descendants appropriately. I wish to have the node ID repeated for each of its descendant IDs. Each cell should only contain one ID (see intended output table in edits above). Thank you. – Ihaka Apr 09 '21 at 01:52
  • @Ihaka Thank you for the edit, I made a change under my "Edit" section, namely, a slight update to the list comprehension for `vals` – Ajax1234 Apr 09 '21 at 01:55
  • It worked! I almost cried with joy. Thank you @Alex1234. You have made a stranger very happy! – Ihaka Apr 09 '21 at 02:32
  • @Ihaka Glad to help! – Ajax1234 Apr 09 '21 at 02:33