22

I'm concatenating two dataframes, so I want to one dataframe is located to another. But first I did some transformation to initial dataframe:

scaler = MinMaxScaler() 
real_data = pd.DataFrame(scaler.fit_transform(df[real_columns]), columns = real_columns)

And then concatenate:

categorial_data  = pd.get_dummies(df[categor_columns], prefix_sep= '__')
train = pd.concat([real_data, categorial_data], axis=1, ignore_index=True)

I dont know why, but number of rows increased:

print(df.shape, real_data.shape, categorial_data.shape, train.shape)
(1700645, 23) (1700645, 16) (1700645, 130) (1703915, 146)

What happened and how fix the problem?

As you can see number of columns for train equals to sum of columns real_data and categorial_data

Watty62
  • 602
  • 1
  • 5
  • 21
Rocketq
  • 5,423
  • 23
  • 75
  • 126
  • 1
    related: https://stackoverflow.com/questions/32801806/pandas-concat-ignore-index-doesnt-work and https://stackoverflow.com/questions/50250228/is-there-a-way-to-horizontally-concatenate-dataframes-of-same-length-while-ignor – EdChum May 16 '18 at 10:15

4 Answers4

33

The problem is that sometimes when you perform several operations on a single dataframe object, the index persists in the memory. So using df.reset_index() will solve your problem.

saket ram
  • 354
  • 3
  • 4
  • 2
    Ran in the same issue, to add a precision, use the df.reset_index() method on the dataframe you want to concatenate together, not on the resulting dataframe. – Nidupb Sep 29 '20 at 22:26
11

While Performing some operations on a dataframe, its dimensions change not the indices, hence we need to perform reset_index operation on the dataframe.

For concatenation you can do like this:

result_df = pd.concat([first_df.reset_index(drop=True), second_df.reset_index(drop=True)], axis=1)
Lucky Suman
  • 342
  • 3
  • 7
4

I solved the problem by using hstack

train = pd.DataFrame(np.hstack([real_data,categorial_data]))
Rocketq
  • 5,423
  • 23
  • 75
  • 126
  • 4
    This way you lose all the dataframe information (e.g column names, index) – Tonca Sep 06 '18 at 10:32
  • @Tonca How can I retain all the dataframe info? – NewInPython Nov 01 '19 at 19:01
  • 1
    In this case the columns are easy to keep because they remain the same as the original dataframes. The problem comes with the index. If the `concat` gives back a different number of rows (as explained in the question), it means that the indices of the DFs are not identical. I think it should be better to understand why are they different instead of forcing the concatenation. – Tonca Nov 04 '19 at 13:20
1

This happens when the indices of dataframes being concatenated differ. After preprocessing, the index of the resultant dataframe gets removed. Setting the index of each dataframe back to the original works i.e. df_concatenated.index = df_original.index.

Art
  • 2,836
  • 4
  • 17
  • 34