0

I have a pandas dataframe with data in 3 columns that look like this:

Column 1 Column 2 Column 3
m26:1_8.6 m26:1_8.6
FA8:0 329448.0313 FA18:1
FA18:1 4048785.5 FA12:0
FA10:0 2232634.75 FA10:0
FA12:0 2541154.25 FA8:0
m26:0_8.8 m26:0_8.8
FA8:0 2084220.75 FA10:0
FA10:0 5801817 FA12:0
FA2:0 158898.5625 FA8:0
FA12:0 3270664.25 FA2:0

Each "block" of data is separated by a row containing 'm', or a blank in the second column. The values in column 2 are "tagged" to column 1. I need to rearrange columns 1 and 2 according to the order of column 3, i.e., column 1 equals column 3 and the corresponding data in column 2 is also rearranged with the same positions as column 1.

The desired output is as follows:

Column 1 Column 2 Column 3
m26:1_8.6 m26:1_8.6
FA18:1 4048785.5 FA18:1
FA12:0 2541154.25 FA12:0
FA10:0 2232634.75 FA10:0
FA8:0 329448.0313 FA8:0
m26:0_8.8 m26:0_8.8
FA10:0 5801817 FA10:0
FA12:0 3270664.25 FA12:0
FA8:0 2084220.75 FA8:0
FA2:0 158898.5625 FA2:0

I tried multiple codes but this was my latest attempt:

# Loop through the original dataframe row by row
for i, row in df.iterrows():
    # Check if it is a header row
    if pd.isna(row['Column 2']):
        # Add the header row to the new dataframe
        new_df = new_df.append(row)
    else:
        # Extract the values from columns 1, 2, and 3
        label = row['Column 1']
        value = row['Column 2']
        position = row['Column 3']
        # Find the index of the label in column 3
        index = new_df.index[new_df['Column 1'] == position][0]
        # Insert a new row in the new dataframe with the values from the original row
        new_row = pd.DataFrame({
            'Column 3': [position],
            'Column 1': [label],
            'Column 2': [value]
        }, index=[index])
        new_df = pd.concat([new_df.iloc[:index], new_row, new_df.iloc[index:]])

print(new_df)

However, it gave me the error IndexError: index 0 is out of bounds for axis 0 with size 0

Could I seek some help to resolve this? Thank you!

Megan Lee
  • 13
  • 1

2 Answers2

0

For each group, assuming Column 1 values are unique:

def sort_group(df):
    x = df['Column 1'].values
    y = df['Column 3'].values
    xsorted = np.argsort(x)
    ypos = np.searchsorted(x[xsorted], y)
    indices = xsorted[ypos]
    return df.drop(columns='Column 3').iloc[indices]

# https://stackoverflow.com/a/8251668/15239951
out = (df.groupby(df['Column 1'].str.startswith('m').cumsum())
         .apply(sort_group).droplevel(0))

Output:

>>> out
    Column 1      Column 2
0  m26:1_8.6           NaN
2     FA18:1  4.048786e+06
4     FA12:0  2.541154e+06
3     FA10:0  2.232635e+06
1      FA8:0  3.294480e+05
5  m26:0_8.8           NaN
7     FA10:0  5.801817e+06
9     FA12:0  3.270664e+06
6      FA8:0  2.084221e+06
8      FA2:0  1.588986e+05
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

Another option is to use a self merge after de-duplicating each bloc :

grp = ((df["Column 1"].str.startswith("m")) | (df["Column 2"].isnull())).cumsum()

out = (df.merge(df.drop(columns=["Column 1", "Column 2"]),
                left_on=[grp, "Column 1"], right_on=[grp, "Column 3"],
                suffixes=("_", ""))[df.columns]
      )

Output :

print(out)

    Column 1   Column 2   Column 3
0  m26:1_8.6        NaN  m26:1_8.6
1      FA8:0  329448.03      FA8:0
2     FA18:1 4048785.50     FA18:1
3     FA10:0 2232634.75     FA10:0
4     FA12:0 2541154.25     FA12:0
5  m26:0_8.8        NaN  m26:0_8.8
6      FA8:0 2084220.75      FA8:0
7     FA10:0 5801817.00     FA10:0
8      FA2:0  158898.56      FA2:0
9     FA12:0 3270664.25     FA12:0
Timeless
  • 22,580
  • 4
  • 12
  • 30