0

I have two inputs as my independent variables and I want to predict 3 dependent variables based on it.

My 3 dependent variables are of 2 multi-categorical classes and 1 is of continuous values. Below is my target variables.

typeid_encoded, reporttype_encoded, log_count

typeid_encoded and reporttype_encoded are of categorical type where each variable has min 5 different categories.

log_count is continuous variable.

I have googled a lot, all I found is to use two different models. But I couldn't find any example to do so. please post some example so that it helps me?

or is there any other approach to using neural networks is it possible to do in one model?

I need an example using sci-kit learn. Thanks in advance!

Farhana Naaz Ansari
  • 7,524
  • 26
  • 65
  • 105

1 Answers1

1

There's nothing in sklearn that's designed for this, but there are few little tricks you could use to make classifiers like this.

Word of caution, these are not necessarily ideal for your problem, it's very hard to guess at what would work for your data.

The two that first came to my mind were Knn and Random Forests, but you can essentially adapt any multi-output regression algorithm to do these things.

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import NearestNeighbors

# Create some data to look like yours
n_samples = 100
n_features = 5

X = np.random.random((n_samples, n_features))
y_classifcation = np.random.random((n_samples, 2)).round()
y_regression = np.random.random((n_samples))

y = np.hstack((y_classifcation, y_regression[:, np.newaxis]))

Now I have a data set with two binary variables and one continuous

Start with Knn, you could do this with KNeighborsRegressor as well but I felt this illustrated the solution better

# use an odd number to prevent tie-breaks
nn = NearestNeighbors(n_neighbors=5)
nn.fit(X, y)

idxs = nn.kneighbors(X, return_distance=False)
# take the average of the nearest neighbours to get the predictions
y_pred = y[idxs].mean(axis=1)
# all predictions will be continous so just round the continous ones
y_pred[:, 2] = y_pred[:, 2].round()

Now our y_pred is the vector of predictions for both the classification and regression. So now let's look a Random Forest.

# use an odd number of trees to prevent predictions of 0.5
rf = RandomForestRegressor(n_estimators=11)
rf.fit(X, y)
y_pred = rf.predict(X)

# all predictions will be continous so just round the continous ones
y_pred[:, 2] = y_pred[:, 2].round()

I'd say these 'hacks' are pretty reasonable because they aren't too far away from how the classification settings of these algorithms work.

If you have a multiclass problem which you have one hot encoded, then instead of rounding the probability to the binary class, as I have done above, you will need to chose the class with the highest probability. You can do this pretty simply using something like this

n_classes_class1 = 3
n_classes_class2 = 4
y_pred_class1 = np.argmax(y_pred[:, :n_classes_class1], axis=1)
y_pred_class2 = np.argmax(y_pred[:, n_classes_class1:-1], axis=1)
piman314
  • 5,285
  • 23
  • 35
  • Actually, my two categorical variables, I had converted into labelencoders. According your above example, i understand all prediction values are continuous. How can it be giving two categorical and one continuous. Is there any workaround? I have tried with neural networks, it gave me all 3 predictions as continuous which i don't want – Manikant Kella Mar 27 '18 at 14:08
  • In many classification algorithms the output is originally continuous and then it is converted into categorical, similarly to how I've done it above. With a neural network this is converted using a softmax function, however here I'm just choosing the class with the highest probability. I'll add a little bit at the end of my answer now to help with the multiclass problem. – piman314 Mar 27 '18 at 14:37
  • Sure. Thank you! Also if possible, can you please explain above ur code with some inputs and output values. I understood a bit, but not completely :) – Manikant Kella Mar 27 '18 at 14:41
  • There's example data in my code (randomly generated), if you copy and paste the code and run it you will be able to go over it yourself. I suggest you work through with your own data to get a better understanding of it. – piman314 Mar 27 '18 at 14:46