-1

I have a distance matrix involving 500 influenza sequences. I want to convert this into a columnar format, with 250,000 pairwise comparisons. Is there a function that will allow me to do this quickly?

Below is the data set I am working with. The index is the "Accession" column, and I am representing the data as a Pandas DataFrame.

    CY135678    CY142013    CY130339    CY130379    CY130460    CY135850    CY135930    CY143958    CY142889    CY141341    CY143073    CY142145    CY142817    CY142417    CY142985    CY136196    CY130412    CY135744    CY135326    CY135502
Accession                                                                               
CY135678     1.000000    0.959670    0.937148    0.932813    0.972692    0.951452    0.996966    0.998266    0.953619    0.993498    0.920628    0.956635    0.921936    0.956030    0.902904    0.968791    0.998700    0.952319    0.917642    0.922440
CY142013     0.959670    1.000000    0.939289    0.936253    0.963573    0.973981    0.956635    0.957936    0.974848    0.954033    0.923245    0.976149    0.924117    0.975620    0.913270    0.960104    0.958369    0.974848    0.923244    0.925926
CY130339     0.937148    0.939289    1.000000    0.975389    0.942783    0.938256    0.934114    0.935847    0.940415    0.935233    0.930222    0.939722    0.930659    0.939051    0.917098    0.938882    0.935847    0.939119    0.927612    0.927233
CY130379     0.932813    0.936253    0.975389    1.000000    0.935847    0.936960    0.929779    0.931946    0.939119    0.931347    0.923681    0.935820    0.924553    0.935133    0.915371    0.932813    0.931513    0.938687    0.925444    0.920697
CY130460     0.972692    0.963573    0.942783    0.935847    1.000000    0.955787    0.969658    0.970958    0.957087    0.966623    0.921936    0.961839    0.922809    0.961254    0.907239    0.991764    0.971391    0.957087    0.917642    0.920697
CY135850     0.951452    0.973981    0.938256    0.936960    0.955787    1.000000    0.947984    0.949718    0.993092    0.946891    0.922372    0.973114    0.923245    0.972573    0.909758    0.953619    0.950152    0.996546    0.916775    0.925054
CY135930     0.996966    0.956635    0.934114    0.929779    0.969658    0.947984    1.000000    0.996099    0.950152    0.991331    0.919320    0.953599    0.920628    0.952982    0.900737    0.965756    0.996532    0.948851    0.914608    0.919390
CY143958     0.998266    0.957936    0.935847    0.931946    0.970958    0.949718    0.996099    1.000000    0.951886    0.992631    0.919756    0.954900    0.921064    0.954288    0.901170    0.967057    0.997833    0.950585    0.916775    0.921569
CY142889     0.953619    0.974848    0.940415    0.939119    0.957087    0.993092    0.950152    0.951886    1.000000    0.949050    0.922372    0.973981    0.923245    0.973444    0.912349    0.954920    0.952319    0.993092    0.918075    0.925490
CY141341     0.993498    0.954033    0.935233    0.931347    0.966623    0.946891    0.991331    0.992631    0.949050    1.000000    0.919756    0.951431    0.921064    0.950805    0.896805    0.963589    0.993065    0.947755    0.915908    0.925054
CY143073     0.920628    0.923245    0.930222    0.923681    0.921936    0.922372    0.919320    0.919756    0.922372    0.919756    1.000000    0.921500    0.999128    0.917139    0.908853    0.917139    0.920192    0.923245    0.942433    0.938945
CY142145     0.956635    0.976149    0.939722    0.935820    0.961839    0.973114    0.953599    0.954900    0.973981    0.951431    0.921500    1.000000    0.921936    0.999565    0.911969    0.957936    0.955334    0.973981    0.918040    0.923747
CY142817     0.921936    0.924117    0.930659    0.924553    0.922809    0.923245    0.920628    0.921064    0.923245    0.921064    0.999128    0.921936    1.000000    0.917575    0.909725    0.918011    0.921500    0.924117    0.942870    0.939817
CY142417     0.956030    0.975620    0.939051    0.935133    0.961254    0.972573    0.952982    0.954288    0.973444    0.950805    0.917139    0.999565    0.917575    1.000000    0.911189    0.957336    0.954724    0.973444    0.917283    0.923312
CY142985     0.902904    0.913270    0.917098    0.915371    0.907239    0.909758    0.900737    0.901170    0.912349    0.896805    0.908853    0.911969    0.909725    0.911189    1.000000    0.902904    0.901170    0.911917    0.900737    0.905011
CY136196     0.968791    0.960104    0.938882    0.932813    0.991764    0.953619    0.965756    0.967057    0.954920    0.963589    0.917139    0.957936    0.918011    0.957336    0.902904    1.000000    0.967490    0.954920    0.913741    0.916340
CY130412     0.998700    0.958369    0.935847    0.931513    0.971391    0.950152    0.996532    0.997833    0.952319    0.993065    0.920192    0.955334    0.921500    0.954724    0.901170    0.967490    1.000000    0.951019    0.916342    0.921133
CY135744     0.952319    0.974848    0.939119    0.938687    0.957087    0.996546    0.948851    0.950585    0.993092    0.947755    0.923245    0.973981    0.924117    0.973444    0.911917    0.954920    0.951019    1.000000    0.918075    0.925926
CY135326     0.917642    0.923244    0.927612    0.925444    0.917642    0.916775    0.914608    0.916775    0.918075    0.915908    0.942433    0.918040    0.942870    0.917283    0.900737    0.913741    0.916342    0.918075    1.000000    0.949455
CY135502     0.922440    0.925926    0.927233    0.920697    0.920697    0.925054    0.919390    0.921569    0.925490    0.925054    0.938945    0.923747    0.939817    0.923312    0.905011    0.916340    0.921133    0.925926    0.949455    1.000000

The output I get after applying affmat.unstack() looks as follows:

       Accession
CY135678  CY135678     0.939085
          CY142013     0.959670
          CY130339     0.937148
          CY130379     0.932813
          CY130460     0.972692
          CY135850     0.951452
          CY135930     0.996966
          CY143958     0.998266
          CY142889     0.953619
          CY141341     0.993498
          CY143073     0.920628
          CY142145     0.956635
          CY142817     0.921936
          CY142417     0.956030
          CY142985     0.902904
...
CY135502  CY135850     0.925054
          CY135930     0.919390
          CY143958     0.921569
          CY142889     0.925490
          CY141341     0.925054
          CY143073     0.938945
          CY142145     0.923747
          CY142817     0.939817
          CY142417     0.923312
          CY142985     0.905011
          CY136196     0.916340
          CY130412     0.921133
          CY135744     0.925926
          CY135326     0.949455
          CY135502     0.939085
Length: 400, dtype: float64

As one can see from the output, CY135678 was supposed to have an identity of 1.000000 with itself, but became 0.939085 after applying affmat.unstack(). Is there an explanation for this behavior? Is there any way I can get the original values stacked properly?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
ericmjl
  • 13,541
  • 12
  • 51
  • 80
  • Show us a short sample of your data. – Dan Allan Jun 25 '13 at 20:25
  • Just did that, after finding out that the ".unstack()" function didn't work as I had expected. – ericmjl Jun 25 '13 at 22:39
  • 1
    Note that you get the exact same value, 0.939085, for the 'identity' entry in the second case too, for 'CY135502.' I suspect there's something wonky about your input data's alignment before using unstack. – ely Jun 25 '13 at 22:45
  • That's interesting. As I described below, I did the `affmat.stack()`, which was, I think, the correct function that was needed, but only after re-indexing the data properly first. I don't have an explanation for why this behavior happens, but I'm glad it's over now. – ericmjl Jun 26 '13 at 20:58

2 Answers2

1

Maybe you are looking for unstack:

In [29]: df
Out[29]: 
      a         b         c         d         e
a  0.453367  0.000969  0.199400  0.515258  0.610870
b  0.949461  0.002380  2.993674  1.357350  0.189058
c  0.117990  1.397985  0.093681  0.417855  0.686190
d  0.757732  4.975183  3.108177  0.019095  1.613240
e  0.022297  0.518517  0.006883  0.896779  0.485518

In [30]: df.unstack()
Out[30]: 
a  a    0.453367
   b    0.949461    
   c    0.117990
   d    0.757732
   e    0.022297
b  a    0.000969
   b    0.002380
   c    1.397985
   d    4.975183
   e    0.518517
c  a    0.199400
   b    2.993674
   c    0.093681
   d    3.108177
   e    0.006883
d  a    0.515258
   b    1.357350
   c    0.417855
   d    0.019095
   e    0.896779
e  a    0.610870
   b    0.189058
   c    0.686190
   d    1.613240
   e    0.485518

That shows each pairwise comparison twice. To pare it down, consider np.triu_indicies_from, discussed in Joe Kington's comment on this answer.

Community
  • 1
  • 1
Dan Allan
  • 34,073
  • 6
  • 70
  • 63
  • Hmm... I'm running into a bit of trouble here. The matrix includes 1.00 identities (perfect matches) across the diagonal. However, when I apply the "unpack" function, all of those turn into a value 0.939085, for which I'm not sure how it's happening. In less than a few minutes, I will update the original question with the data set I'm working with. – ericmjl Jun 25 '13 at 22:18
  • I second EMS's comment on your question. The columns are probably not what you think they are. Check that you are parsing the data correctly by looking at individual columns. Consider the delimiter. – Dan Allan Jun 26 '13 at 02:08
  • These are DataFrames in Pandas, and the file format is a CSV file that I have created using Excel, hence the delimiter is correct. The individual columns are correct as well, as viewed in my iPython Notebook. The weird thing is the value 0.939085 doesn't show up anywhere in the original matrix. – ericmjl Jun 26 '13 at 02:15
0

Well now, somehow I managed to fix the problem by not setting the index to 'Accession' while importing the data. Instead, I called on .set_index() function only when I really needed the index to be 'Accession'. Hence, when I wanted to stack the data (as opposed to unstack it), what I ended up doing was doing:

affmat_stacked = affmat.set_index('Accession').stack()

That seemed to work things out nicely. Thanks again for all the hints, everybody!

ericmjl
  • 13,541
  • 12
  • 51
  • 80