Pandas Concat increases number of rows

Question

I'm concatenating two dataframes, so I want to one dataframe is located to another. But first I did some transformation to initial dataframe:

scaler = MinMaxScaler() 
real_data = pd.DataFrame(scaler.fit_transform(df[real_columns]), columns = real_columns)

And then concatenate:

categorial_data  = pd.get_dummies(df[categor_columns], prefix_sep= '__')
train = pd.concat([real_data, categorial_data], axis=1, ignore_index=True)

I dont know why, but number of rows increased:

print(df.shape, real_data.shape, categorial_data.shape, train.shape)
(1700645, 23) (1700645, 16) (1700645, 130) (1703915, 146)

What happened and how fix the problem?

As you can see number of columns for train equals to sum of columns real_data and categorial_data

related: https://stackoverflow.com/questions/32801806/pandas-concat-ignore-index-doesnt-work and https://stackoverflow.com/questions/50250228/is-there-a-way-to-horizontally-concatenate-dataframes-of-same-length-while-ignor — EdChum, May 16 '18 at 10:15

score 33 · Accepted Answer · answered Apr 17 '19 at 10:59

33

The problem is that sometimes when you perform several operations on a single dataframe object, the index persists in the memory. So using df.reset_index() will solve your problem.

answered Apr 17 '19 at 10:59

saket ram

354
3
4

2

Ran in the same issue, to add a precision, use the df.reset_index() method on the dataframe you want to concatenate together, not on the resulting dataframe. – Nidupb Sep 29 '20 at 22:26

score 11 · Answer 2 · answered May 17 '22 at 16:58

While Performing some operations on a dataframe, its dimensions change not the indices, hence we need to perform reset_index operation on the dataframe.

For concatenation you can do like this:

result_df = pd.concat([first_df.reset_index(drop=True), second_df.reset_index(drop=True)], axis=1)

score 4 · Answer 3 · answered May 16 '18 at 13:35

4

I solved the problem by using hstack

train = pd.DataFrame(np.hstack([real_data,categorial_data]))

answered May 16 '18 at 13:35

Rocketq

5,423
23
75
126

4

This way you lose all the dataframe information (e.g column names, index) – Tonca Sep 06 '18 at 10:32
@Tonca How can I retain all the dataframe info? – NewInPython Nov 01 '19 at 19:01
1

In this case the columns are easy to keep because they remain the same as the original dataframes. The problem comes with the index. If the `concat` gives back a different number of rows (as explained in the question), it means that the indices of the DFs are not identical. I think it should be better to understand why are they different instead of forcing the concatenation. – Tonca Nov 04 '19 at 13:20

score 1 · Answer 4 · edited Aug 26 '21 at 09:40

1

This happens when the indices of dataframes being concatenated differ. After preprocessing, the index of the resultant dataframe gets removed. Setting the index of each dataframe back to the original works i.e. df_concatenated.index = df_original.index.

edited Aug 26 '21 at 09:40

Art

2,836
4
17
34

answered Aug 25 '21 at 02:51

SACHIN KUMAR

11
1

Pandas Concat increases number of rows

4 Answers4

Linked