0

how to remove rows with same order in a column value obtained from Cartesian product / Cross join. For e.g., if a row with column value is AB and other row is BA. I need to have only 1 row AB.? Please note that I am doing this for a huge dataset obtained from cartseian product. Hence time of processing & Memory usage is key (I have only 8GB Ram) -- the no of rows is 10 Millions rows by 2000 columns ? For e.g., see below image.

enter image description here

Desired output:

enter image description here

Henry Yik
  • 22,275
  • 4
  • 18
  • 40
shravan
  • 15
  • 1
  • 3

1 Answers1

0

Two ways I can think of.

df = pd.DataFrame({"UserName_x":["A","B","A","C"],
                   "UserName_y":["B","A","C","A"],
                   "edge":["AB","BA","AC","CA"]})

Method 1: sort your edge string alphabetically and drop_duplicates:

df["sorted"] = df["edge"].apply(lambda x: ''.join(sorted(x)))

print (df.drop_duplicates(subset="sorted"))

#
  UserName_x UserName_y edge sorted
0          A          B   AB     AB
2          A          C   AC     AC

Method 2: Create a product of users without duplicates first, and check with isin:

from itertools import combinations_with_replacement

unique = df["UserName_x"].unique().tolist()
combos = ["".join(i) for i in combinations_with_replacement(unique,2)]

print (df[df["edge"].isin(combos)])

#
  UserName_x UserName_y edge sorted
0          A          B   AB     AB
2          A          C   AC     AC
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
  • Thanks it's working. my program crashes with "memory error" as I have million of records to process. Any solution to avoid this ?? – shravan Nov 06 '19 at 05:43
  • Which method you use that has the memory error? both? It will also help to tell which particular line caused it. – Henry Yik Nov 06 '19 at 06:31
  • 1
    I Changed my code. There is not memory crash - Thanks – shravan Nov 06 '19 at 11:09