1

I have a dataframe that contains information that is linked by an ID column. The rows are sequential with the odd rows containing a "start-point" and the even rows containing an "end" point. My goal is to collapse the data from these into a single row with columns for "start" and "end" following each other. The rows do have a "packet ID" that would link them if the sequential nature of the dataframe is not consistent.

example:

    df:
        0   1           2           3       4   5
    0   hs6 106956570   106956648   ID_A1   60  -
    1   hs1 153649721   153649769   ID_A1   60  -
    2   hs1 865130744   865130819   ID_A2   0   -
    3   hs7 21882206    21882237    ID_A2   0   -
    4   hs1 74230744    74230819    ID_A3   0   +
    5   hs8 92041314    92041508    ID_A3   0   + 

The resulting dataframe that I am trying to achieve is:

    new_df
        0   1           2           3   4           5
    0   hs6 106956570   106956648   hs1 153649721   153649769
    1   hs1 865130744   865130819   hs7 21882206    21882237
    2   hs1 74230744    74230819    hs8 92041314    92041508

with each row containing the information on both the start and the end-point.

I have tried to pass the IDs in to an array and use a for loop to pull the information out of the original dataframe into a new dataframe but this has not worked. I was looking at the melt documentation which would suggest that pd.melt(df, id_vars=[3], value_vars=[0,1,2]) may work but I cannot see how to get the corresponding row in to positions new_df[3,4,5].

I think that it may be something really simple that I am missing but any suggestions would be appreciated.

1 Answers1

0

You can try this:

df_out = df.set_index([df.index%2, df.index//2])[df.columns[:3]]\
           .unstack(0).sort_index(level=1, axis=1)
df_out.columns = np.arange(len(df_out.columns))
df_out

Output:

     0          1          2    3          4          5
0  hs6  106956570  106956648  hs1  153649721  153649769
1  hs1  865130744  865130819  hs7   21882206   21882237
2  hs1   74230744   74230819  hs8   92041314   92041508
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Thanks a lot for that, I will accept it as the correct answer as it definitely does what I need it to do. I have been trying to understand each bit and just wanted to check with you, it may also help any others that are facing a similar problem. `.unstack(0)` splits the df based on the `df[3]` index. `%` and `//` appear to modulus and floor divisions on the index. Do these remove the duplicate values? Just trying to increase my knowledge – Mat Forster Jul 12 '18 at 14:01
  • @MatForster Best way to understand what is going on is to run the statement in parts and see what is returned. First, the set_index, yes, I am using modulus and floordiv to define rows and columns based on the integers in the current dataframe index. Next, we use unstack to pivot the first level of the index to columns. – Scott Boston Jul 12 '18 at 14:25