0

Thank you for your help in advance. I am trying to use the RandomUnderSampler() and fit_sample() methods from imblearn to balance a botnet dataset with two missing values. The dataset contains a label column for binary classification that uses 0 and 1 as values. I am using Azure ML designer where I created a Python Script Execute Module and handled the missing data using the mean(). There are no infinity values and the largest decimal value is 5,747.13 and the smallest value is 0.

**Dataset sample with few entries: **

enter image description here

Code Snippet:

def azureml_main(dataframe1 = None, dataframe2 = None):

    # Handle Nan values 
    dataframe1.fillna(dataframe1.mean(), inplace=False)
    
    # Execution logic goes here
    rus = RandomUnderSampler(random_state=0)

    X = dataframe1.drop(dataframe1[['label']], axis=1)
    y = np.squeeze(dataframe1[['label']]) 

    X_rus, y_rus = rus.fit_sample(X, y) # **line 32 with the ValueError**

**Error: **

---------- Start of error message from Python interpreter ----------
Got exception when invoking script at line 32 in function azureml_main: 'ValueError: Input contains NaN, infinity or a value too large for dtype('float64').'.
---------- End of error message from Python  interpreter  ----------

I used fillna to address the 2 missing values. I am not sure how to handle the large decimal values without affecting the current values.

molbdnilo
  • 64,751
  • 3
  • 43
  • 82
Ghada
  • 1
  • 1
  • I was able to solve this issue. I replaced the missing values with 0. – Ghada Sep 09 '22 at 23:16
  • Please read the [description](https://stackoverflow.com/tags/ml/info) of the ML tag. – molbdnilo Sep 10 '22 at 05:33
  • Thank you! I thought it means machine learning. – Ghada Sep 11 '22 at 14:35
  • @Ghada could you please post the solution in answer section. It would help other community members – Madhuraj Vadde Sep 20 '22 at 02:27
  • This is how I resolved the issue: I used the to_numeric() function to convert the string to numeric after removing the spaces in the string. columns = ['flgs', 'proto', 'saddr', 'daddr', 'state', 'category', 'subcategory'] for x in columns: dataframe1[x] = pd.to_numeric(dataframe1[x].str.replace(' ', ''), downcast='float', errors ='coerce').fillna(0) – Ghada Sep 24 '22 at 12:33

1 Answers1

0

Thank you Ghada. Posting your solution into answer section to help other community members.

Used the to_numeric() function to convert the string to numeric after removing the spaces in the string.

columns = ['flgs', 'proto', 'saddr', 'daddr', 'state', 'category', 'subcategory'] for x in columns: dataframe1[x] = pd.to_numeric(dataframe1[x].str.replace(' ', ''), downcast='float', errors ='coerce').fillna(0)

Madhuraj Vadde
  • 1,099
  • 1
  • 5
  • 13