I have dataset like this:
Description attributes.occasion.0 attributes.occasion.1 attributes.occasion.2 attributes.occasion.3 attributes.occasion.4
descr01 Chanukah Christmas Housewarming Just Because Thank You
descr02 Anniversary Birthday Christmas Graduation Mother's Day
descr03 Chanukah Christmas Housewarming Just Because Thank You
descr04 Baby Shower Birthday Cinco de Mayo Gametime Just Because
descr05 Anniversary Birthday Christmas Graduation Mother's Day
descr01 => description about the occasions(i have just put the short name in real data set its full text description) and so on.
In above data set i have single independent variable which has text description and 4 dependent categorical variables.
I tried Random Forest classifier which takes multiple dependent as input.
One Example of the data set
attributes.occasion.0 attributes.occasion.1 attributes.occasion.2 attributes.occasion.3 attributes.occasion.4
Back to School Birthday School Events NaN NaN
descrption:
Cafepress Personalized 5th Birthday Cowgirl Kids Light T-Shirt:100 percent cotton Youth T-Shirt by Hanes,Preshrunk, durable and guaranteed
Below is code that i have tried:
## Split the dataset
X_train, X_test, y_train, y_test = train_test_split(df['Description'],df[['attributes.occasion.0','attributes.occasion.1','attributes.occasion.2','attributes.occasion.3','attributes.occasion.4']], test_size=0.3, random_state=0)
## Apply the model
from sklearn.ensemble import RandomForestClassifier
tfidf = Pipeline([('vect', HashingVectorizer(ngram_range=(1,7),non_negative=True)),
('tfidf', TfidfTransformer()),
])
def feature_combine(dataset):
Xall = []
i=1
for col in cols_to_retain:
if col != 'item_id' and col != 'last_updated_at':
Xall.append(tfidf.fit_transform(dataset[col].astype(str)))
joblib.dump(tfidf, "tfidf.sav")
Xspall = scipy.sparse.hstack(Xall)
#print Xspall
return Xspall
def test_Data_text_transform_and_combine(dataset):
Xall = []
i=1
for col in cols_to_retain:
if col != 'item_id' and col != 'last_updated_at':
Xall.append(tfidf.transform(dataset[col].astype(str)))
Xspall = scipy.sparse.hstack(Xall)
return Xspall
from sklearn.ensemble import RandomForestClassifier
text_clf = RandomForestClassifier()
_ = text_clf.fit(feature_combine(X_train), y_train)
RF_predicted = text_clf.predict(test_Data_text_transform_and_combine(X_test))
np.mean(RF_predicted == y_test)*100
I got below output when I calculated accuracy measure? But I know hoe to interpret this result and how to plot the confusion matrix and other performance measures.
Output:
Accuracy for each dependent
attributes.occasion.0 87.517672
attributes.occasion.1 96.050306
attributes.occasion.2 98.362394
attributes.occasion.3 99.184142
attributes.occasion.4 99.564090
Could any tell me how to deal with multi-label problem and how to evaluate the model performance. What are the possible approaches in such case. I am using python sci-kit learn library.
Thanks, Niranjan