1

I am dealing with some finanial data containing reversals. Reversals are basically a correction in a table that offsets another values in the table by adding in the table an equal amount of the opposite sign. My job is to clear those values. Take for example this dataframe:

df = pd.DataFrame({"a":["a","b","c","a","a"],
                  "b":[-2,5,2,2,7],
                 "xtra_col":["X","X","X","X","X"]})

    a   b   xtra_col
0   a   -2  X
1   b   5   X
2   c   2   X
3   a   2   X
4   a   7   X

In this case row 3 is a reversal to row 0 and they have to be dropped. In the same time row 2 is not a reversal to row 0, though opposite values, because they do not match on column a. The result has to look as so.

    a   b   xtra_col
0   b   5   X
1   c   2   X
2   a   7   X

The question is, how can I delete such reversals from my table? I have looked at drop_duplicates() with subset as a and b, but that would not work, because it will only match same values, but not opposite. I get the feeling that I might be able to achieve something with groupby, but I am not sure how to organize it.

Additional note, it should also work in cases where there is an odd number of negative values. Considerig the case below, the output should be as follows:

df = pd.DataFrame({"a":["a","b","c","a","a"],
                  "b":[-2,5,2,2.0,-2],
                 "xtra_col":["X","X","X","X","X"]})


a   b   xtra_col
0   a   -2.0    X
1   b   5.0 X
2   c   2.0 X
3   a   2.0 X
4   a   -2.0    X

output should be:

a   b   xtra_col
1   b   5.0 X
2   c   2.0 X
3   1   -2.0    X
callmeGuy
  • 944
  • 2
  • 11
  • 28

3 Answers3

3

If only one numeric column b is possible create filtered DataFrame, invert b by multiple -1 and match rows by DataFrame.merge, last filter out index values by Series.isin and boolean indexing:

df1 = df[df['b'].lt(0)].copy()
df1['b'] *= -1

df2 = df1.reset_index().merge(df.reset_index(), on=['a','b']).filter(like='index_')
print (df2)
   index_x  index_y
0        0        3

df = df[~df.index.isin(df2.values.ravel())]
print (df)
   a  b xtra_col
1  b  5        X
2  c  2        X
4  a  7        X

If possible another a 2 row and you need avoid remove it (because not pair with another a -2) add GroupBy.cumcount for counter column in filtered and original DataFrame:

df = pd.DataFrame({"a":["a","b","c","a","a",'a'],
                  "b":[-2,5,2,2,7,2],
                 "xtra_col":["X","X","X","X","X",'X']})


df1 = df[df['b'].lt(0)].copy()
c = df1.select_dtypes(np.number).columns
df1[c] *= -1

df1['g'] = df1.groupby(['a','b']).cumcount()
df['g'] = df.groupby(['a','b']).cumcount()
df2 = df1.reset_index().merge(df.reset_index(), on=['a','b','g']).filter(like='index_')
print (df2)


df = df[~df.index.isin(df2.values.ravel())]
print (df)
   a  b xtra_col  g
1  b  5        X  0
2  c  2        X  0
4  a  7        X  0
5  a  2        X  1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Using the power of SQL in python. Here you join the table (dataframe) to itself while checking for cases where column a is the same and column b is reversed. Using the where clause, you can then filter.

see mockup below:

import sqlite3
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":["a","b","c","a","a"],
                  "b":[-2,5,2,2,7],
                 "xtra_col":["X","X","X","X","X"]})

#Make the db in memory
conn = sqlite3.connect(':memory:')
df.to_sql('tab', conn, index=False)

qry = '''
    select  
       tab1.a,tab1.b,tab1.xtra_col
    from
        tab as tab1 

        left join tab as tab2 on
            tab1.a =tab2.a
            and
            tab1.b = -tab2.b
        where tab2.a is null
    '''
dfres = pd.read_sql_query(qry, conn)
dfres

and results here:

a   b   xtra_col
0   b   5   X
1   c   2   X
2   a   7   X
MEdwin
  • 2,940
  • 1
  • 14
  • 27
0

Here is another way to do with apply to find invalid rows and remove them:

# Import module
import pandas as pd

# Your data
df = pd.DataFrame({"a": ["a", "b", "c", "a", "a"],
                   "b": [-2, 5, 2, 2, 7],
                   "xtra_col": ["X", "X", "X", "X", "X"]})

# Filtering function
def filter_row(row):
    # Your condition comparing the current row with the whole dataframe
    if sum((df.a == row.a) & (df.b == -row.b)) == 1:
        return row

# Apply the filter method
row_to_remove = df.apply(filter_row, axis=1)
print(row_to_remove)  # You can use drop NA to remove NA rows
#       a    b xtra_col
# 0     a - 2.0       X
# 1  None  NaN     None
# 2  None  NaN     None
# 3     a  2.0        X
# 4  None  NaN     None

# Drop invalid rows
result = df[(df != row_to_remove).any(axis=1)]
print(result)
#    a  b xtra_col
# 1  b  5        X
# 2  c  2        X
# 4  a  7        X
Alexandre B.
  • 5,387
  • 2
  • 17
  • 40