0

I have a df:

df = pd.DataFrame({'src':['LV','LA','NC','NY','ABC','XYZ'], 'dest':['NC','NY','LV','LA','XYZ','ABC'], 'dummy':[1,3,6,7,8,10]})
src   dest   dummy
LV      NC       1
LA      NY       3
NC      LV       6
NY      LA       7
ABC     XYZ      8
XYZ     ABC     10

I run it through:

df['pair'] = df[['src', 'dest']].apply(lambda x : tuple(set(x)), 1).factorize()[0] + 1

to try and key off unique pairs such as (a->b, b->a)

I correctly end up with this:

src   dest   dummy  pair
LV      NC       1     1
LA      NY       3     2
NC      LV       6     1
NY      LA       7     2
ABC     XYZ      8     3
XYZ     ABC     10     3

However, sometimes when I run it I end up incorrectly with this:

 src   dest   dummy  pair
LV      NC       1     1
LA      NY       3     2
NC      LV       6     1
NY      LA       7     2
ABC     XYZ      8     3
XYZ     ABC     10     4

As you can see, the last element is not being properly keyed off to pair '3' for some reason. This happens randomly. I am able to reproduce this by commenting out the 'pairing off' code, running the script to make and print the df, then uncommenting and trying again. You may be able to reproduce this in other ways by running with other modifications.

How can I fix this non deterministic behavior?

reeeeeeeeeeee
  • 129
  • 2
  • 10

1 Answers1

1

Try with that is the propblem with set , you can change it to frozenset

df['pair'] = pd.DataFrame(np.sort(df[['src','dest']].values,1)).agg(tuple,1).factorize()[0]+1
Out[108]: array([1, 2, 1, 2, 3, 3], dtype=int64)
BENY
  • 317,841
  • 20
  • 164
  • 234