Remove rows with same order in a column value obtained from Cartesian product / Cross join.If a row value is AB and other value is BA. I need only AB

Question

how to remove rows with same order in a column value obtained from Cartesian product / Cross join. For e.g., if a row with column value is AB and other row is BA. I need to have only 1 row AB.? Please note that I am doing this for a huge dataset obtained from cartseian product. Hence time of processing & Memory usage is key (I have only 8GB Ram) -- the no of rows is 10 Millions rows by 2000 columns ? For e.g., see below image.

Desired output:

score 0 · Accepted Answer · answered Nov 06 '19 at 05:09

Two ways I can think of.

df = pd.DataFrame({"UserName_x":["A","B","A","C"],
                   "UserName_y":["B","A","C","A"],
                   "edge":["AB","BA","AC","CA"]})

Method 1: sort your edge string alphabetically and drop_duplicates:

df["sorted"] = df["edge"].apply(lambda x: ''.join(sorted(x)))

print (df.drop_duplicates(subset="sorted"))

#
  UserName_x UserName_y edge sorted
0          A          B   AB     AB
2          A          C   AC     AC

Method 2: Create a product of users without duplicates first, and check with isin:

from itertools import combinations_with_replacement

unique = df["UserName_x"].unique().tolist()
combos = ["".join(i) for i in combinations_with_replacement(unique,2)]

print (df[df["edge"].isin(combos)])

#
  UserName_x UserName_y edge sorted
0          A          B   AB     AB
2          A          C   AC     AC

Thanks it's working. my program crashes with "memory error" as I have million of records to process. Any solution to avoid this ?? — shravan, Nov 06 '19 at 05:43
Which method you use that has the memory error? both? It will also help to tell which particular line caused it. — Henry Yik, Nov 06 '19 at 06:31

Remove rows with same order in a column value obtained from Cartesian product / Cross join.If a row value is AB and other value is BA. I need only AB

1 Answers1