1

I'm using a Multiple Imputer from sklearn library to impute some missing values from rain datasets, containing the rain stations and the rain data (each station a column, and the index are DateTime). I was able to run the IterativeImputer and get an output with all my missing values filled. The problem is that the output contains negative values. It's possible to change de min_value that he imputes, but it sets a unique value for all the columns. I wanna set a min_value based on the minimal value for each column before the imputation. There is a response here in Stack for that answer, but I've no clue how to do it.

The code I'm using:

import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector


#Babitonga's region stations
babi_ana = pd.read_csv(all_csv_files[0]).set_index("Time") #Here a read the csv data

# Transforming my index to datetime
babi_ana.index = pd.to_datetime(babi_ana.index)
mask = (babi_ana.index > ini1) & (babi_ana.index <= fim1) #Selecting the date range 
babi_ana1 = babi_ana.loc[mask]

# Applying the imputer
imputer_data = IterativeImputer(random_state = 0,skip_complete=True,sample_posterior=True, max_iter = 10, missing_values = np.nan)
data = babi_ana1 
minimum = data.iloc[:,:].min(axis=0) #No negative values from the original
imputer_data.fit(data.iloc[:,:].values)
data_imputed = imputer_data.transform(data.iloc[:,:].values)

# Here I realize the output has negative values
data_imputed = pd.DataFrame(data_imputed)
minimun_after = data_imputed.iloc[:,:].min(axis=0) #several negative values, except for 2 stations

I wanna be able to use the min_value and max_value based on the max and min from each station before the imputation, like this:

max_imputer = data.iloc[:,:].max(axis = 0)
min_imputer = data.iloc[:,:].min(axis = 0)
  • 1
    Please post your code, instead of a verbal description of it. – desertnaut May 21 '20 at 23:19
  • Sorry about that. – Bryan Thomas May 21 '20 at 23:34
  • Bryan, welcome to StackOverflow :). I think your question is missing some description of what you have tried to do, what your goal was and what went wrong. Those would help get more answers. I am not familiar with Imputers or the sklearn library, but I'd recommend you look into this question: https://stackoverflow.com/questions/38150330/python-sklearn-imputer-usage?rq=1 and the answers. If nothing else, it will show you well formatted questions/answers. Good luck! – Gabriel Pires May 22 '20 at 18:27
  • I've read a bit on https://scikit-learn.org/stable/modules/impute.html about the Imputer and got some questions for you: 1) Is there a reason you cannot use the `SimpleImputer` instead? It is simpler to use and if you take the mean for missing values, they won't ever be negative (unless your input contains negative integers, but unlikely for rain data). 2) Can you show where `babi_ana1` comes from or what the data looks like? I wonder if the `data.iloc[:,:].values)` looks as expected. – Gabriel Pires May 22 '20 at 18:39
  • Hey, thank you, Gabriel, I'm using the `IterativeImputer` to fill the missing data from all the rain station of a region, so the function will get information about every station and input a value, doing it several times (they called a round-robin fashion), till it has a decent result. I had not used the `SimpleImputer` because I wanna get a better result based on Multivariate feature imputation. I'll edit the question to give a more detailed explanation. – Bryan Thomas May 22 '20 at 19:20

1 Answers1

2

Great improvements on the question :).

I've read a bit more about the IterativeImputer here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer.

It seems that it can take a min_value parameter on the constructor, it expects either a float or an array. If you have a minimum value for all features (columns) of your data, you can just use the float alternative.

For example, if you want the minimum possible value to be 0 in all features (columns), you could change your code to:

imputer_data = IterativeImputer(random_state = 0, skip_complete = True,sample_posterior = True, max_iter = 10, missing_values = np.nan, min_value = 0)

On the other hand, if you want different minimum values for different features, you need to use an array as long as the number of features. For example: if you have 2 features and the minimum values should be 0 and 5, respectively, you would change your code to:

imputer_data = IterativeImputer(random_state = 0, skip_complete = True,sample_posterior = True, max_iter = 10, missing_values = np.nan, min_value = [0, 5])

You can do the same for the max_value parameter.

The first change should make sure you don't get any more negative imputed values.

If you want to use the min and max values based on the data you already have, the first step should be to write code that goes over that feature in your data and gets both the minimum and maximum values there. It should be the same as getting min and max values in an array, you can probably find lots of Python examples for that if you aren't sure how to do it.

As a final note, it's still a bit weird to me how the Imputer output negative data after fitting with only positive data. So I'd double check that data.iloc[:,:].values really is the data you want in the format the Imputer is expecting.

Gabriel Pires
  • 376
  • 5
  • 9
  • Hey Gabriel, thank you. It worked using a float, but as I said I wanna use the minimum value for each station (if this doesn't work I will use a float to fix). I tried to use your alternative. I created ` np.arange` with six values (the number of my stations) and tried to use them. When I run the `imputer_data.fit(data.iloc[:,:].values)` it returns an error: `operands could not be broadcast together with shapes (3,) (6,)` . So I try to use a `np.arange(3)`, and get this: `NumPy boolean array indexing assignment cannot assign 3 input values to the 0 output values where the mask is true`. – Bryan Thomas May 22 '20 at 22:47