Thank you for your help in advance. I am trying to use the RandomUnderSampler() and fit_sample() methods from imblearn to balance a botnet dataset with two missing values. The dataset contains a label column for binary classification that uses 0 and 1 as values. I am using Azure ML designer where I created a Python Script Execute Module and handled the missing data using the mean(). There are no infinity values and the largest decimal value is 5,747.13 and the smallest value is 0.
**Dataset sample with few entries: **
Code Snippet:
def azureml_main(dataframe1 = None, dataframe2 = None):
# Handle Nan values
dataframe1.fillna(dataframe1.mean(), inplace=False)
# Execution logic goes here
rus = RandomUnderSampler(random_state=0)
X = dataframe1.drop(dataframe1[['label']], axis=1)
y = np.squeeze(dataframe1[['label']])
X_rus, y_rus = rus.fit_sample(X, y) # **line 32 with the ValueError**
**Error: **
---------- Start of error message from Python interpreter ----------
Got exception when invoking script at line 32 in function azureml_main: 'ValueError: Input contains NaN, infinity or a value too large for dtype('float64').'.
---------- End of error message from Python interpreter ----------
I used fillna to address the 2 missing values. I am not sure how to handle the large decimal values without affecting the current values.