36

I have the following df:

code . role    . persons
123 .  Janitor . 3
123 .  Analyst . 2
321 .  Vallet  . 2
321 .  Auditor . 5

The first line means that I have 3 persons with the role Janitors. My problem is that I would need to have one line for each person. My df should look like this:

df:

code . role    . persons
123 .  Janitor . 3
123 .  Janitor . 3
123 .  Janitor . 3
123 .  Analyst . 2
123 .  Analyst . 2
321 .  Vallet  . 2
321 .  Vallet  . 2
321 .  Auditor . 5
321 .  Auditor . 5
321 .  Auditor . 5
321 .  Auditor . 5
321 .  Auditor . 5

How could I do that using pandas?

cs95
  • 379,657
  • 97
  • 704
  • 746
aabujamra
  • 4,494
  • 13
  • 51
  • 101
  • Possible duplicate: [Duplicate row based on value in different column](https://stackoverflow.com/questions/32792263/duplicate-row-based-on-value-in-different-column) – Sam Firke Jun 30 '23 at 17:00

4 Answers4

64

reindex+ repeat

df.reindex(df.index.repeat(df.persons))
Out[951]: 
   code  .     role ..1  persons
0   123  .  Janitor   .        3
0   123  .  Janitor   .        3
0   123  .  Janitor   .        3
1   123  .  Analyst   .        2
1   123  .  Analyst   .        2
2   321  .   Vallet   .        2
2   321  .   Vallet   .        2
3   321  .  Auditor   .        5
3   321  .  Auditor   .        5
3   321  .  Auditor   .        5
3   321  .  Auditor   .        5
3   321  .  Auditor   .        5

PS: you can add.reset_index(drop=True) to get the new index

BENY
  • 317,841
  • 20
  • 164
  • 234
  • Wonderful, I knew there was a good solution with repeat, but this nailed it. – cs95 Nov 16 '17 at 18:29
  • 3
    Yep, this was nice. Maybe a final reset_index() too? – Anton vBR Nov 16 '17 at 18:29
  • 1
    @Wen I love learning new stuff! You would believe, that i didn't no you could reuse index values in reindex. I have alway used reindex to shuffle or add indexes but never to duplicate has you have done here. Beautiful. Nice one. +1 – Scott Boston Nov 16 '17 at 18:54
  • @ScottBoston thanks Man :-) SO is good place push us learning from each others (I learn it from coldspeed long time ago :-) ) – BENY Nov 16 '17 at 18:59
15

Wen's solution is really nice and intuitive, however it will fail for duplicate rows by throwing ValueError: cannot reindex from a duplicate axis.

Here's an alternative which avoids this by calling repeat on df.values.

df

   code     role  persons
0   123  Janitor        3
1   123  Analyst        2
2   321   Vallet        2
3   321  Auditor        5


pd.DataFrame(df.values.repeat(df.persons, axis=0), columns=df.columns)

   code     role persons
0   123  Janitor       3
1   123  Janitor       3
2   123  Janitor       3
3   123  Analyst       2
4   123  Analyst       2
5   321   Vallet       2
6   321   Vallet       2
7   321  Auditor       5
8   321  Auditor       5
9   321  Auditor       5
10  321  Auditor       5
11  321  Auditor       5
cs95
  • 379,657
  • 97
  • 704
  • 746
  • when it comes to performance, what is it better `.reindex()` or `.values.repeat()`? – lmiguelvargasf Jun 18 '18 at 15:29
  • 1
    @lmiguelvargasf This solution is faster. But Wen's solution requires fewer characters, plus I was nice enough to leave a nice comment under his answer which spurred all the extra upvotes. – cs95 Jun 18 '18 at 15:30
  • 1
    the only problem I saw with your solution is that the `dtypes` are changed to `object` for every column in the dataframe. – lmiguelvargasf Jun 18 '18 at 17:15
  • 1
    This works when there are duplicate rows as opposed to BENY's solution which throws a `ValueError` (`ValueError: cannot reindex from a duplicate axis`) – There May 24 '22 at 18:34
4

Not enough reputation to comment, but building on @cs95's answer and @lmiguelvargasf's comment, one can preserve dtypes with:

pd.DataFrame(
    df.values.repeat(df.persons, axis=0),
    columns=df.columns,
).astype(df.dtypes)
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
1

You can apply the Series method repeat:

df = pd.DataFrame({'col1': [2, 3],
                   'col2': ['a', 'b'],
                   'col3': [20, 30]})

df.apply(lambda x: x.repeat(df['col1']))
# df.apply(pd.Series.repeat, repeats=df['col1'])

or the numpy function repeat:

df.apply(np.repeat, repeats=df['col1'])

Output:

   col1 col2  col3
0     2    a    20
0     2    a    20
1     3    b    30
1     3    b    30
1     3    b    30
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73