0

I have dataset like this:

Description  attributes.occasion.0 attributes.occasion.1    attributes.occasion.2   attributes.occasion.3   attributes.occasion.4

 descr01        Chanukah                Christmas               Housewarming        Just Because                Thank You
 descr02        Anniversary             Birthday                Christmas           Graduation                  Mother's Day
 descr03        Chanukah                Christmas               Housewarming        Just Because                Thank You
 descr04        Baby Shower             Birthday                Cinco de Mayo       Gametime                    Just Because
 descr05        Anniversary             Birthday                Christmas           Graduation                  Mother's Day

descr01 => description about the occasions(i have just put the short name in real data set its full text description) and so on.

In above data set i have single independent variable which has text description and 4 dependent categorical variables.

I tried Random Forest classifier which takes multiple dependent as input.

One Example of the data set

    attributes.occasion.0   attributes.occasion.1   attributes.occasion.2   attributes.occasion.3   attributes.occasion.4
    Back to School                Birthday               School Events           NaN                      NaN


descrption:

Cafepress Personalized 5th Birthday Cowgirl Kids Light T-Shirt:100 percent cotton Youth T-Shirt by Hanes,Preshrunk, durable and guaranteed

Below is code that i have tried:

## Split  the dataset
X_train, X_test, y_train, y_test = train_test_split(df['Description'],df[['attributes.occasion.0','attributes.occasion.1','attributes.occasion.2','attributes.occasion.3','attributes.occasion.4']], test_size=0.3, random_state=0)

## Apply the model


    from sklearn.ensemble import RandomForestClassifier

    tfidf = Pipeline([('vect', HashingVectorizer(ngram_range=(1,7),non_negative=True)),

('tfidf', TfidfTransformer()),

])

def feature_combine(dataset):
    Xall = []
    i=1
    for col in cols_to_retain:
        if col != 'item_id' and col != 'last_updated_at':
            Xall.append(tfidf.fit_transform(dataset[col].astype(str)))

    joblib.dump(tfidf, "tfidf.sav")
    Xspall = scipy.sparse.hstack(Xall)

    #print Xspall
    return Xspall

def test_Data_text_transform_and_combine(dataset):
    Xall = []
    i=1

    for col in cols_to_retain:
        if col != 'item_id' and col != 'last_updated_at':
            Xall.append(tfidf.transform(dataset[col].astype(str)))

    Xspall = scipy.sparse.hstack(Xall)

    return Xspall

from sklearn.ensemble import RandomForestClassifier
text_clf = RandomForestClassifier()
_ = text_clf.fit(feature_combine(X_train), y_train)

RF_predicted = text_clf.predict(test_Data_text_transform_and_combine(X_test))

np.mean(RF_predicted  == y_test)*100 

I got below output when I calculated accuracy measure? But I know hoe to interpret this result and how to plot the confusion matrix and other performance measures.

Output:

Accuracy for each dependent 

attributes.occasion.0    87.517672
attributes.occasion.1    96.050306
attributes.occasion.2    98.362394
attributes.occasion.3    99.184142
attributes.occasion.4    99.564090

Could any tell me how to deal with multi-label problem and how to evaluate the model performance. What are the possible approaches in such case. I am using python sci-kit learn library.

Thanks, Niranjan

niranjan
  • 269
  • 3
  • 5
  • 13
  • can you describe, in general, what you are trying to do? I can see from the code what you are doing, but I'm not sure how that makes sense. In particular why do you want to predict attributes 1 - 4 from the Description, shouldn't it be the other way around? – miraculixx Sep 21 '16 at 14:19
  • ok if I am not wrong its case of multi-class multi-label classification problem. My independent column called description contains description about the occasion. Dependent column contains different types of occasions. Here in my case there are 4 dependent columns and each one has multiple classes(occasions in my case). My requirement is that i want to build a model such that it can predict class and the label as well. here is the sci-kit learn docs http://scikit-learn.org/stable/modules/multiclass.html. – niranjan Sep 22 '16 at 05:31
  • Other way is i think find the patterns and extract using regular expression but this is not robust as we need to find the all possible patterns from the description first then use regex to extract. – niranjan Sep 22 '16 at 05:44
  • please give an example what you want this to do. From your code and description it looks like you want to return attributes 1 - 4 from giving the algorithm inputs like `descr01`, `descr02` etc. If so that's easier and more efficiently solved by some sort of table lookup. It would really help if you describe the problem you want to solve (not the algorithm you are using). – miraculixx Sep 22 '16 at 08:36
  • My apologies. All are dependent columns(attributes.occasion.0,1,2,3,4) in my data set and one independent column(description). I want build a model to classify my text description in relevant class and label. Consider this is multi-class multi-output classification problem. – niranjan Sep 22 '16 at 09:18
  • I'm not expressing myself clearly - I would like to understand the _business problem_, not your data, not your algorithm... In other words _why_ do you want to build a model? What is the model supposed to do for your company, client, project? – miraculixx Sep 22 '16 at 11:32
  • So for each description, you want to predict 5 values? One for each 'attributes.occasion'? – Stergios Sep 29 '16 at 13:49
  • yes.Its a multi-label classification. I used RandomForestclassifier its givng me an output.I edited my question with code. But I don't know how to interpret the result? – niranjan Oct 04 '16 at 11:44
  • @miraculixx My sincere apologies for delay as I caught up into the something else. Ok let me explain why I want to build a model. I have description about the products in the form of title,short and long description.I have many attributes for the particular product like occasion(attribute) in my case. So on what occasion the particular product we sell or we give it to people. So my task is that i only have the description of the product with me and I want to build a intelligence system or model or script which gives me the desired output by using the description. – niranjan Oct 04 '16 at 11:55
  • @Stergios Is there any better way to tackle this problem. I would really appreciate it. – niranjan Oct 04 '16 at 12:06
  • So you basically get a product (description) and now you want to know on which occassions this product is most likely to be sold? – miraculixx Oct 04 '16 at 12:50
  • yes.And the case is that one product could be sold or used or gifted on more than occasions. This information about the occasion is included in the product descriptions. But it included in the descriptive way. For example, this is the title of the product `Beetlejuice Adult Halloween Costume Standard` and the product type is `Fancy-Dress Costumes` – niranjan Oct 04 '16 at 12:59
  • So occasion is `Halloween`. I edited the question with one more example. – niranjan Oct 04 '16 at 13:08
  • It is very hard to find any particular pattern that I could identify the desired output. That is why I tried to a train a model so it could predict the desired output by learning the text(description ) of the product. – niranjan Oct 04 '16 at 13:15
  • From what I understand you already _know_ the full description and the occasions per each product. If so, this sounds more like a table lookup problem - i.e. build the table that you use for training your model, then when you get a description just look up the occasions from this table. That's what your code does already, kind off, but it requires a lot of overhead when a simple table lookup seems sufficient. – miraculixx Oct 04 '16 at 15:49
  • .... unless of course you want to be able to get fuzzy user input and from that get the actual product (e.g. `Betteljuic` => `Beetlejuice`). Then the Y should be the correct spelling of the product or some product id which you can then use to lookup the occasions. Does that make sense? As for accuracy what this tells you is of the items in your test dataset x% were identified correctly according to the model. – miraculixx Oct 04 '16 at 15:49
  • You can check the 'binary relevance' method i.e. you build a separate True/False classifier for each possible occassion. Or read about the 'classifier chain' method (the output of of the above binary classifiers is given as input to the next one). With the 2nd method it is somewhat possible to detect correlations between different occassions (e.g.products sold on Halloween are also sold on Parties) – Stergios Oct 05 '16 at 07:51

0 Answers0