1

I have the following df:

name   id   line_number    add_el    add_ver    del_el    del_ver                                                                  
name1  1    1              elem2     1.3        elem1     1.2
name1  1    2              elem3     1.4        elem3     1.1
name1  1    3              elem1     1.2        elem5     1.3
name2  2    1              elem10    2.0        elem11    2.1
name2  2    3              elem11    2.1        elem13    2.3

The above df represents the lines in a csv file where the del_el is an add_el on another line. I want to add a column action in which the value would be "replace" if for the same (name, id), the del_el is equal to the add_el column on another line_number.

Desired output

name   id   line_number    add_el    add_ver    del_el  del_ver  action                                                                
name1  1    1              elem2     1.3        elem1     1.2    none
name1  1    2              elem3     1.4        elem3     1.1    update
name1  1    3              elem1     1.2        elem5     1.3    replace
name2  2    1              elem10    2.0        elem11    2.1    none
name2  2    3              elem11    2.1        elem13    2.3    replace

Sample code to recreate the input df

df = pd.DataFrame({'name':['name1', 'name1', 'name1', 'name2', 'name2'], 
                   'id': [1, 1, 1, 2, 2], 
                   'line_number': [1, 2, 3, 1, 3], 
                   'add_el': ['elem2', 'elem3', 'elem1', 'elem10', 'elem11'], 
                   'add_ver': ['1.3', '1.4', '1.2', '2.0', '2.1'],
                   'del_el': ['elem1', 'elem3', 'elem5', 'elem11', 'elem13'],
                   'del_ver': ['1.2', '1.1', '1.3', '2.1', '2.3']})

In my current solution, I define the actions as follow: Tuple format: ((add_el_name, added_el_ver), (del_el_name, del_el_ver))

((NaN, NaN), (X1, V1)) - delete
((X1, V2), (X1, V1)) - update
((X1, V1), (X1, V2)) - downgrade
((X1, V1), (X2, V2)) - replace 

Code of my current solution:

def get_action_type(row):
    if str(row['add_el']) == 'nan' and str(row['del_el']) != 'nan':
        return 'delete'
    else:
        if (version.parse(str(row['add_ver'])) > version.parse(str(row['del_ver']))) and str(row['add_el']) == str(row['del_el']):
            return 'update'
        elif (version.parse(str(row['add_ver'])) < version.parse(str(row['del_ver']))) and str(row['add_el']) == str(row['del_el']):
            return 'downgrade'
        elif (version.parse(str(row['add_ver'])) == version.parse(str(row['del_ver']))) and str(row['add_el']) == str(row['del_el']):
            return 'none'
        else:
            if str(row['add_el']) != str(row['del_el']) and str(row['del_el']) != 'nan':
                return 'replace'
            else:
                return 'unknown'

My current solution only verifies if the name and version of the deleted element are different from the ones of the added element. I need the "replace" action to verify if the del_el, del_ver is added to another line_number of the same (name, id).

  • " I created a function that checks if added_element is different from deleted_element" it would be helpful if you would [edit] to include a [mcve] showing the code of your function so that we can help work with your existing code rather than having to rewrite from scratch. It would also be helpful to see what you have tried so far to fill out the 'action' column and what went wrong with your atempts – G. Anderson Feb 14 '22 at 17:24
  • For example, [How to determine whether a dataframe column contains a particular value](https://stackoverflow.com/questions/21319929/how-to-determine-whether-a-pandas-column-contains-a-particular-value) could be helpful here, but we don;t know what you've tried – G. Anderson Feb 14 '22 at 17:25
  • 1
    Interesting question but the title *"Verify if elements of pandas columns have been shuffled"* has no connection to what is going on, nor does it explain what this is about. The code will still work fine regardless of whatever order the dataframe is in. It's all completely order-agnostic. A more accurate title would simply be *"Add column C to dataframe based on whether column A appears in column B"*. – smci Feb 14 '22 at 17:38
  • @G.Anderson see my edit. – Jasmine Latendresse Feb 14 '22 at 17:44
  • I had tried to oversimplify my problem for the purpose of the question, but it might have lacked details. – Jasmine Latendresse Feb 14 '22 at 17:44

2 Answers2

1

One approach is to merge the DataFrame with itself on name, id, and added_element on the left, and deleted_element on the right:

# Create a copy of the original DataFrame and prefill an "action" column
right = df[['name', 'id', 'line_number', 'del_el', 'del_ver']].copy()
right['action'] = 'replace'

print(right)

    name  id  line_number  del_el del_ver   action
0  name1   1            1   elem1     1.2  replace
1  name1   1            2   elem3     1.1  replace
2  name1   1            3   elem5     1.3  replace
3  name2   2            1  elem11     2.1  replace
4  name2   2            3  elem13     2.3  replace

# Left-merge the original DataFrame with this modified version
res = pd.merge(df, right,
               how='left',
               left_on=['name', 'id', 'add_el', 'add_ver'],
               right_on=['name', 'id', 'del_el', 'del_ver'],
               suffixes=['', '_right']).drop(columns='del_el_right')

# This has no effect given your example data, but we want
# to ensure that a "replace" action is only assigned to rows
# where the deleted and added elements had different line numbers
res = res[res['line_number'] != res['line_number_right']]
res = res.drop(columns=['line_number_right', 'del_ver_right'])    

print(res)

    name  id  line_number  add_el add_ver  del_el del_ver   action
0  name1   1            1   elem2     1.3   elem1     1.2      NaN
1  name1   1            2   elem3     1.4   elem3     1.1      NaN
2  name1   1            3   elem1     1.2   elem5     1.3  replace
3  name2   2            1  elem10     2.0  elem11     2.1      NaN
4  name2   2            3  elem11     2.1  elem13     2.3  replace
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
  • This seems to work (regardless of my edit). However, It seems that when I use this solution on my real data, I am losing rows with the merge. – Jasmine Latendresse Feb 14 '22 at 18:09
  • I noticed the line `res = res[res['line_number'] != res['line_number_right']]` is the one making me lose rows. I understand the purpose of it, but it drops rows that I want to keep. Is there another way to ensure that the "replace" action is assigned to the desired rows? – Jasmine Latendresse Feb 14 '22 at 21:02
  • Weird that you're losing rows... I've updated my answer to account for your edited question, so this merges on the version numbers as well. Maybe that was the issue? – Peter Leimbigler Feb 14 '22 at 22:47
0

The solution I came up with consists in grouping the rows by name and id and aggregating the columns added and deleted into a list(removed version for simplicity purpose). More info here.

res = df.groupby(['name', 'id']).agg(tuple).applymap(list).reset_index()

I then create a column replaced with list comprehension that returns the set intersection between added and deleted elements. More info here.

res['replaced'] = [(set(a) & set(b)) if len((set(a) & set(b))) != 0 else 'NaN' for a, b in zip(res.added, res.deleted)]
res = res[['name', 'id', 'replaced']] #selecting necessary columns

I merge the result with the original dataframe so I have the set intersection in each row.

res_final = pd.merge(df, res, on=['name', 'id']) #merging with original df

I finally create a function that checks if the deleted element appears in the set intersection column replaced. If yes, then the label "replace" is added. Else, I just return the action that was previously there. To ensure that we are not looking at elements on the same row, I verify if the action isn't none (based on the code in my question post).

def is_it_replaced(row):
    if str(row['deleted']) in str(row['replaced']) and str(row['action']) != 'none':
        return 'replace'
    else:
        return str(row['action'])

res_final['action_type'] = res_final.apply(lambda x: is_it_replaced(x), axis=1)
res_final = res_final.drop(columns=['action', 'replaced']) #final cleanup

Good: it works Bad: it's slow, especially if you dataframe is big. It is preferable to avoid list comprehension.