2

I have a time series index with few variables and humidity reading. I have already trained an ML model to predict Humidity values based on X, Y and Z. Now, when I load the saved model using pickle, I would like to fill the Humidity missing values using X, Y and Z. However, it should consider the fact that X, Y and Z themselves shouldnt be missing.

Time                    X        Y        Z       Humidity
1/2/2017 13:00          31       22       21           48
1/2/2017 14:00          NaN      12       NaN          NaN
1/2/2017 15:00          25       55       33           NaN

In this example, the last row of humidity will be filled using the model. Whereas the 2nd row should not be predicted by the model since X and Z is also missing.

I have tried this so far:

with open('model_pickle','rb') as f:
    mp = pickle.load(f)

for i, value in enumerate(df['Humidity'].values):
    if np.isnan(value):
        df['Humidity'][i] = mp.predict(df['X'][i],df['Y'][i],df['Z'][i])

This gave me an error 'predict() takes from 2 to 5 positional arguments but 6 were given' and also I did not consider X, Y and Z column values. Below is the code I used to train the model and save it to a file:

df = df.dropna()

dfTest = df.loc['2017-01-01':'2019-02-28']
dfTrain = df.loc['2019-03-01':'2019-03-18'] 
features = [ 'X', 'Y', 'Z'] 
train_X = dfTrain[features]
train_y = dfTrain.Humidity
test_X = dfTest[features]
test_y = dfTest.Humidity

model = xgb.XGBRegressor(max_depth=10,learning_rate=0.07)
model.fit(train_X,train_y)
predXGB = model.predict(test_X)
mae = mean_absolute_error(predXGB,test_y)
import pickle
with open('model_pickle','wb') as f:
    pickle.dump(model,f)

I had no errors during training and saving the model.

Sakib Shahriar
  • 121
  • 1
  • 12
  • On what dataset did you train the model? Did that dataset have also missing values in the X, Y, Z features? If so, how did you handle them? – Stergios May 14 '20 at 07:54
  • Yes on the same dataset but using other year's data. The model was trained using complete data (no missing) so that's not the problem. – Sakib Shahriar May 14 '20 at 18:17
  • OK, if you had missing values in the training set, then you would have to use the same imputation methods in the test set as well. You have to use one of the existing imputation methods, no other way around it. Of course, you should expect lower accuracy in your model compared to, say, your cross-validation error due to this imputation thing. – Stergios May 15 '20 at 06:09
  • Please clarify: what model are you loading? Are you sure that using ```.predict``` works? you show us a error regarding a method you did not supply, so it is a bit hard to help you. Also, please explain if you used any imputation methods during training – Roim May 15 '20 at 13:14
  • @Roim I have edited the question to add more clarity. – Sakib Shahriar May 15 '20 at 19:39
  • @Sakib Shahriar Did you try the below answer? Any update/closure on the question? – dumbPy May 16 '20 at 07:26
  • You can impute the missing values, i.e. "mean", "most frequent" or "ARIMA" etc. – information_interchange May 16 '20 at 15:45
  • @information_interchange yes I'm aware but this is a project that's exploring the effectiveness of custom approaches to missing values. – Sakib Shahriar May 16 '20 at 19:49

3 Answers3

0

Can you report the error?

Anyway, if you have missing values you have different options to deal with them. You can either discard the datapoint entirely or try and infer the missing parts with a method of choice: mean, interpolation, etc.

Pandas documentation has a nice guide on how to deal with them: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

Hexash
  • 48
  • 6
  • I've edited the question with the error reported. I'm aware of the various missing values interpolation using pandas but in my case, the variable that's missing is related closely to the other variables so using this approach will give accurate imputation. – Sakib Shahriar May 09 '20 at 19:15
  • You are loading a model which you do not specify, so I can't really tell why it's asking 2-5 arguments. Anyway it won't work if you don't deal with the missing values. Again either you delete the whole line or you substitute the X, Y, Z NaNs with something else (interpolation, mean, carry over, etc.) – Hexash May 09 '20 at 23:59
0

Try

df['Humidity'][i] = mp.predict(df[['X', 'Y', 'Z']][i])  

This way you the data is passed as a single argument, as function expects. The way you wrote it, you split your data to 3 arguments.

igrinis
  • 12,398
  • 20
  • 45
0

For prediction, since you want to make sure you have all the X, Y, Z values, you can do,

df = df.dropna(subset = ["X", "Y", "Z"])

And now you can predict the values for the remaining valid examples as,

# where features = ["X", "Y", "Z"]
df['Humidity'] = mp.predict(df[features]) 

mp.predict will return prediction for all the rows, so there is no need to predict iteratively.

Edit:.

For inference, say you have a dataframe df, you can do,

# Get rows with missing Humidity where it can be predicted.
df_inference = df[df.Humidity.isnull()]

# remaining rows
df = df[df.Humidity.notnull()]

# This might still have rows with missing features.
# Since you cannot infer with missing features, Remove them too and add them to remaining rows
df = df.append(df_inference[df_inference[features].isnull().any(1)])

# and remove them from df_inference
df_inference = df_inference[~df_inference[features].isnull().any(1)]

#Now you can infer on these rows
df_inference['Humidity'] = mp.predict(df_inference[features])

# Now you can merge this back to the remaining rows to get the original number of rows and sort the rows by index
df = df.append(df_inference)
df.sort_index()
dumbPy
  • 1,379
  • 1
  • 6
  • 19
  • Thanks @dumpPy this worked and did not give any error. However, now all the rows are predicted by the model, i.e. the original data which were not missing were also predicted. I understand this approach without loop is quicker but I should also be checking whether the row was missing in the first place. Any suggestions? – Sakib Shahriar May 16 '20 at 19:47
  • I was hoping you had separated train and test data from the inference data (data with missing Humidity value). Adding this to the answer – dumbPy May 16 '20 at 20:32
  • Let me clarify a bit more. Training is not relvant here because I'm loading a pre-trained model. Now lets say the humidity column has 15% missing values, I would want to run predict on those 15% rows only. Remaining will remain untouched hope it makes sense – Sakib Shahriar May 16 '20 at 22:48
  • Ok. The `df_inference` I defined above is what you need then. `df = df[df.Humidity.isnull()]` will give you dataFrame such that Humidity if NaN for all rows. You can get the predictions as `df['Humidity'] = mp.predict(df[features])` – dumbPy May 16 '20 at 23:10
  • Any way I can retain the original dataset? Because right now I am only selecting the NaN rows that's discarding the X,Y and Z values as well. – Sakib Shahriar May 17 '20 at 01:06
  • 1
    Updated the answer to reflect everything you need. Inferring on inferrable rows and merging it back to original rows to get back all the rows. – dumbPy May 17 '20 at 15:47
  • Thank you so much! – Sakib Shahriar May 17 '20 at 17:30