Duplicating rows in pandas Python

Question

i hope you are doing good . I have the following output :

ClassName   Bugs   HighBugs  LowBugs  NormalBugs  WMC   LOC

 Class1      4        0        1         3        34     77 
 Class2      0        0        0         0        9      45
 Class3      3        0        1         2        10     18
 Class4      0        0        0         0        44     46
 Class5      6        2        2         2        78     94

The result i want is as follow :

ClassName   Bugs   HighBugs  LowBugs  NormalBugs  WMC   LOC

 Class1      1        0        0         1        34     77
 Class1      1        0        0         1        34     77
 Class1      1        0        0         1        34     77
 Class1      1        0        1         0        34     77
 Class2      0        0        0         0        9      45
 Class3      1        0        0         1        10     18
 Class3      1        0        0         1        10     18
 Class3      1        0        1         0        10     18
 Class4      0        0        0         0        44     46
 Class5      1        0        0         1        78     94
 Class5      1        0        0         1        78     94
 Class5      1        0        1         0        78     94
 Class5      1        0        1         0        78     94
 Class5      1        1        0         0        78     94
 Class5      1        1        0         0        78     94

Little explanation , what i want is to duplicate the classes depending on the column Bugs and Bugs = HighBugs + LowBugs + NormalBugs , as you can see in the result i want is that when the classes are duplicated we have only one's and zero's depending on the number of Bugs.

Thank you in advance and have a good day you all .

Henry Ecker · Answer 1 · 2021-09-05T15:18:53.103

We can try finding the max value in a given row using DataFrame.max on axis=1, then use Index.repeat to scale up the DataFrame based on the maximal value in a given Class. Lastly, we can count the number of rows per group using groupby cumcount and compare where the current value is DataFrame.gt the group row number:

cols = df.columns[df.columns.str.endswith('Bugs')]
df = df.loc[
    df.index.repeat(df[cols].max(axis=1).clip(lower=1))
].reset_index(drop=True)
df[cols] = df[cols].gt(df.groupby('ClassName').cumcount(), axis=0).astype(int)

df:

   ClassName  Bugs  HighBugs  LowBugs  NormalBugs
0     Class1     1         0        1           1
1     Class1     1         0        0           1
2     Class1     1         0        0           1
3     Class1     1         0        0           0
4     Class2     0         0        0           0
5     Class3     1         0        1           1
6     Class3     1         0        0           1
7     Class3     1         0        0           0
8     Class4     0         0        0           0
9     Class5     1         1        1           1
10    Class5     1         1        1           1
11    Class5     1         0        0           0
12    Class5     1         0        0           0
13    Class5     1         0        0           0
14    Class5     1         0        0           0

Setup:

import pandas as pd

df = pd.DataFrame({
    'ClassName': {0: 'Class1', 1: 'Class2', 2: 'Class3', 3: 'Class4',
                  4: 'Class5'},
    'Bugs': {0: 4, 1: 0, 2: 3, 3: 0, 4: 6},
    'HighBugs': {0: 0, 1: 0, 2: 0, 3: 0, 4: 2},
    'LowBugs': {0: 1, 1: 0, 2: 1, 3: 0, 4: 2},
    'NormalBugs': {0: 3, 1: 0, 2: 2, 3: 0, 4: 2}
})

Column filter:

cols = df.columns[df.columns.str.endswith('Bugs')]

Index(['Bugs', 'HighBugs', 'LowBugs', 'NormalBugs'], dtype='object')

Max value per row (to repeat):

df[cols].max(axis=1).clip(lower=1)

0    4
1    1
2    3
3    1
4    6
dtype: int64

Scaled DataFrame:

df = df.loc[
    df.index.repeat(df[cols].max(axis=1).clip(lower=1))
].reset_index(drop=True)

   ClassName  Bugs  HighBugs  LowBugs  NormalBugs
0     Class1     4         0        1           3
1     Class1     4         0        1           3
2     Class1     4         0        1           3
3     Class1     4         0        1           3
4     Class2     0         0        0           0
5     Class3     3         0        1           2
6     Class3     3         0        1           2
7     Class3     3         0        1           2
8     Class4     0         0        0           0
9     Class5     6         2        2           2
10    Class5     6         2        2           2
11    Class5     6         2        2           2
12    Class5     6         2        2           2
13    Class5     6         2        2           2
14    Class5     6         2        2           2

Group Rows:

df.groupby('ClassName').cumcount()

0     0
1     1
2     2
3     3
4     0
5     0
6     1
7     2
8     0
9     0
10    1
11    2
12    3
13    4
14    5
dtype: int64

Comparison to convert numbers to binary

df[cols].gt(df.groupby('ClassName').cumcount(), axis=0)

     Bugs  HighBugs  LowBugs  NormalBugs
0    True     False     True        True
1    True     False    False        True
2    True     False    False        True
3    True     False    False       False
4   False     False    False       False
5    True     False     True        True
6    True     False    False        True
7    True     False    False       False
8   False     False    False       False
9    True      True     True        True
10   True      True     True        True
11   True     False    False       False
12   True     False    False       False
13   True     False    False       False
14   True     False    False       False

Thank you for you answer , what if my columns does not end with "bugs" ? — Miraiinik, Sep 05 '21 at 19:47
Just change `cols` -> `cols = ['col1', 'col2', 'col3', ... etc]` — Henry Ecker, Sep 05 '21 at 19:48
Can you please take a look at my code above , i edited it and see how can you do it ? Thank you in advance — Miraiinik, Sep 05 '21 at 20:28

Andrej Kesely · Accepted Answer · 2021-09-05T20:28:32.610

Try:

dfs, col_names, other_cols = (
    [],
    ["NormalBugs", "LowBugs", "HighBugs"],
    ["ClassName", "WMC", "LOC"],
)
for _, row in df.iterrows():
    if row["Bugs"] == 0:
        dfs.append(
            pd.DataFrame(
                [[0, 0, 0, *[row[c] for c in other_cols]]],
                columns=col_names + other_cols,
            )
        )

    else:
        for c in col_names:
            dfs.append(pd.DataFrame([1] * row[c], columns=[c]))
            for oc in other_cols:
                dfs[-1][oc] = row[oc]


df_out = pd.concat(dfs).fillna(0)
df_out[col_names] = df_out[col_names].astype(int)
df_out["Bugs"] = df_out[col_names].any(axis=1).astype(int)
print(
    df_out[
        ["ClassName", "Bugs", "HighBugs", "LowBugs", "NormalBugs", "WMC", "LOC"]
    ]
)

Prints:

  ClassName  Bugs  HighBugs  LowBugs  NormalBugs  WMC  LOC
0    Class1     1         0        0           1   34   77
1    Class1     1         0        0           1   34   77
2    Class1     1         0        0           1   34   77
0    Class1     1         0        1           0   34   77
0    Class2     0         0        0           0    9   45
0    Class3     1         0        0           1   10   18
1    Class3     1         0        0           1   10   18
0    Class3     1         0        1           0   10   18
0    Class4     0         0        0           0   44   46
0    Class5     1         0        0           1   78   94
1    Class5     1         0        0           1   78   94
0    Class5     1         0        1           0   78   94
1    Class5     1         0        1           0   78   94
0    Class5     1         1        0           0   78   94
1    Class5     1         1        0           0   78   94

EDIT: Added more columns.

Thank you for you answer it worked but i still have other columns (metrics) that i want to let, not just the Bugs (not a problem if they are duplicated as well) , i edited my code above so you can understand , Thank you — Miraiinik, Sep 05 '21 at 20:15

Duplicating rows in pandas Python

2 Answers2