7

When converting a Pandas dataframe to a H2O frame using the h2o.H2OFrame() function an error is occurring.

Additional rows are being created in the H2o Frame. When I looked into this, it appears the new rows are duplicates of other rows. Depending on the data size the number of duplicate rows added varies, but typically around 2-10.

Code:

train_h2o = h2o.H2OFrame(python_obj=train_df_complete)

print(train_df_complete.shape[0])
print(train_h2o.nrow)

Output:

3871998
3872000

As you can see here, 2 additional rows have being added. When studied closer there are now 2 rows per user for 2 of the users. I.e. 2 rows have being duplicated.

This appears to be a major bug, does anyone have experience of this problem and is there a way to fix it?

Thanks

George
  • 674
  • 2
  • 7
  • 19
  • This issue will likely be triggered by specific dataset. Can you please provide more details about the data? Are there any string columns with multi-line values? We are aware of an issue with NA values (https://0xdata.atlassian.net/browse/PUBDEV-4723) but your problem seems different. – Michal Kurka Aug 14 '17 at 16:34
  • The Pandas data frame had the following structure: RangeIndex: 3871998 entries, 0 to 3871997 Data columns (total 34 columns) dtypes: float64(27), int64(4), object(3) memory usage: 1004.4+ MB. There were no multi-line strings and the duplicate rows happened at the same index each time. – George Aug 15 '17 at 08:10
  • Thank you, I was not able to reproduce the issue on a synthetic dataset. Would you be able to file a bug in jira.h2o.ai? It would help if the jira issue included also H2O logs. – Michal Kurka Aug 15 '17 at 21:57
  • 2
    I get the same with [this dataset](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/download/test.7z) (need to be logged in to Kaggle). 892,816 rows, 57 columns, mix of floats and integers. No strings or NaNs. 3 rows are duplicated at positions 90989, 197125, and 409416. H2O 3.14.0.7, win7. Just `h2o.H2OFrame(pd.read_csv('test.csv', index_col='id'))`. – bckygldstn Oct 25 '17 at 15:41
  • I get the exact same problem... I'm using H2O version `3.18.0.2` – nirvana-msu Aug 01 '18 at 13:29
  • @MichalKurka Im getting the exact same problem as well, one extra row is being added when I conver into an h2o frame on a specific dataset I am using. – Nate Thompson Sep 05 '18 at 17:00
  • Can you please export the data to a CSV file and then h2o.import_file and let us know if that will produce expected results? – Michal Kurka Sep 06 '18 at 22:49
  • @MichalKurka I am also getting exactly the same issue. Has anyone managed to solve this? – Alan Chalk Jul 23 '19 at 23:24

3 Answers3

2

I had the same issue, assume your "train_h2o" does not have duplicates, just identify the index of the duplicates in dataframe and remove it. Unfortunately, the h2o Dataframe has limited functionality.

temp_df = train_h2o.as_data_frame()
train_h2o = train_h2o.drop(list(temp_df[temp_df.duplicated()].index), axis=0)
Alex G
  • 21
  • 3
0

In case your dataset can contain other duplicate rows that do not come from this H2O bug, the proposed solution will drop also those rows. If you want to make sure that you remove only the additional rows added by H2O, this solution might help you out:

temp_df = train_df_complete.copy()
temp_df['__temp_id__'] = np.arange(len(temp_df))
train_h2o = H2OFrame(temp_df)
train_h2o.drop_duplicates(columns=['__temp_id__'], keep='first')
train_h2o = train_h2o.drop('__temp_id__', axis=1)

What I'm doing here is creating a temporary column that I'll then use as ID in order to drop only the duplicates that have been generated by H2OFrame. Once the duplicates have been remove I drop the temporary column. It might not the most elegant way, but it works.

Molessia
  • 464
  • 1
  • 4
  • 17
0

I had the same issue with a specific dataset. Reset index on the base data frame worked for me.

import h2o

train_df_complete = train_df_complete.reset_index()
train_h2o = h2o.H2OFrame(train_df_complete)

I am using h2o 3.30.1.3.