0

I have a data frame with x variables and an id_number 1:n (n is large). I want to create a new data frame that horizontally merges each pair based on id_number from the data frame. Original data looks like this:

id_number   var_x1   var_x2   
1           sth      stuff   
2           other    things  
3           more     info  

I want to get this for every possible pair:

id_numberA  var_x1A var_x2A id_numberB var_x1B var_x2B
1           sth     stuff   1          sth     stuff
1           sth     stuff   2          other   things
1           sth     stuff   3          more    info
2           other   things  3          more    info

What is the most efficient way to do this for a large dataset?

pandini
  • 69
  • 7
  • Why in the requested output is there a combination of `id = 1` with itself, but there isn't a combination of `id = 2 (or 3)` with itself? – Aryerez Oct 22 '19 at 08:54
  • Because combination of id=1 and id=2 is already done in the second row, so it's not necessary to combine id=2 and id=1 a second time. – pandini Oct 22 '19 at 10:10
  • I asked why in the first row of the requested output you combined `id = 1` with `id = 1`, but you do not combine `id = 2` with `id = 2`. Anyhow, in the last line of the code in my answer, it filters existing combinations. – Aryerez Oct 22 '19 at 10:13

1 Answers1

0

You can create a merging index with:

df['temp'] = 1

And then merge the dataframe to itself with:

merged_df = df.merge(df, on='temp', suffixes=('A', 'B')).drop('temp', axis=1)

If you don't want the combinations of the same id_number, do finally:

merged_df = merged_df[merged_df['id_numberA'] != merged_df['id_numberB']]

And if you don't want duplicated mixes of id_numberA and id_numberB, do finally instead:

merged_df = merged_df[merged_df['id_numberA'] < merged_df['id_numberB']]
Aryerez
  • 3,417
  • 2
  • 9
  • 17