23

I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:

df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})

  class  count
0     A      1
1     B      0
2     C      2

should be transformed to:

  class 
0     A   
1     C   
2     C 

This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?

One possibility might be to allow DataFrame.applymap function return multiple rows (akin apply method of GroupBy). However, I do not think it is possible in pandas now.

cs95
  • 379,657
  • 97
  • 704
  • 746
btel
  • 5,563
  • 6
  • 37
  • 47
  • 1
    I have also in mind a general function that will allow to return multiple, one or zero rows depending on values in `count` column. – btel Oct 24 '12 at 13:29
  • 1
    If you are coming to this question in 2017+, check my answer for a more efficient and straightforward solution. – Ted Petrou Aug 29 '17 at 19:17

4 Answers4

31

You could use groupby:

def f(group):
    row = group.irow(0)
    return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)

so you get

In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]: 
  class
0     A
0     C
1     C

You can fix the index of the result however you like

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • 3
    Good answer! If I have dozens of other columns, is there an easy way to preserve those columns in the result of `f` other than typing them all out explicitly? – Kyle Heuton Mar 14 '13 at 07:25
6

There is even a simpler and significantly more efficient solution. I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.

A better way is to use numpy's repeat procedure for generating a new index in which each row index is repeated multiple times according to its given count, and use iloc to select rows of the original table according to this index:

import pandas as pd
import numpy as np

df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs 

array([0, 2, 2])

df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)

  class
0     A
1     C
2     C
Shay
  • 71
  • 1
  • 3
5

I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.

You just basically specify what column has the counts in it in and you get an expanded dataframe in return.

import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
                   'class 2': [ 1,  2,  3,  1], 
                   'count':   [ 3,  3,  3,  1]})
print df,"\n"

def f(group, *args):
    row = group.irow(0)
    Dict = {}
    row_dict = row.to_dict()
    for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
    return pd.DataFrame(Dict)

def ExpandRows(df,WeightsColumnName):
    df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
    return df_expand


df_expanded = ExpandRows(df,'count')
print df_expanded

Returns:

  class 1  class 2  count
0       A        1      3
1       B        2      3
2       C        3      3
3       A        1      1 

  class 1  class 2  count
0       A        1      1
1       A        1      3
2       A        1      3
3       A        1      3
4       B        2      3
5       B        2      3
6       B        2      3
7       C        3      3
8       C        3      3
9       C        3      3

With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.

Radical Edward
  • 5,234
  • 5
  • 21
  • 33
2

This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrows to loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concat to concatenate all the rows together.

pd.concat([pd.DataFrame(data=[row], index=range(row['count'])) 
           for _, row in df.iterrows()], ignore_index=True)

  class  count
0     A      1
1     C      2
2     C      2

This has the benefit of working with any size DataFrame.

Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • 6
    Is this more performant that Wes' group by & apply answer? Is there going to be excessive memory use by all the intermediate dataframes created? I suppose using a list comprehension across dataframes is not the same as doing it across rows, although the question did ask for a solution without using list comprehension. I've read some articles about apply method faster that iterrows, and some about apply doing deep copies and blowing out memory usage, would be interested in your thoughts in the context of your answer. – Davos Mar 22 '18 at 01:04
  • Isn't iterrows terribly slow? – Union find Jun 16 '21 at 15:55
  • This is terribly slow if you have hundreds of thousands of lines to iterate on. – schmat_90 May 11 '22 at 16:13