When converting a Pandas dataframe to a H2O frame using the h2o.H2OFrame() function an error is occurring.
Additional rows are being created in the H2o Frame. When I looked into this, it appears the new rows are duplicates of other rows. Depending on the data size the number of duplicate rows added varies, but typically around 2-10.
Code:
train_h2o = h2o.H2OFrame(python_obj=train_df_complete)
print(train_df_complete.shape[0])
print(train_h2o.nrow)
Output:
3871998
3872000
As you can see here, 2 additional rows have being added. When studied closer there are now 2 rows per user for 2 of the users. I.e. 2 rows have being duplicated.
This appears to be a major bug, does anyone have experience of this problem and is there a way to fix it?
Thanks