4

I'm trying to introduce LightGBM for text multiclassification. 2 columns in pandas dataframe, where 'category' and 'contents' are set as follows.

Dataframe:

    contents               category  
1   this is example1...    A  
2   this is example2...    B  
3   this is example3...    C  

*Actual data frame consists of approx 600 rows and 2 columns.

Hereby I'm trying to classify text into 3 categories as follows.

Codes:

import pandas as pd
import numpy as np

from nltk.corpus import stopwords
stopwords1 = set(stopwords.words('english'))

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

import lightgbm as lgbm
from lightgbm import LGBMClassifier, LGBMRegressor


#--main code--#  
X_train, X_test, Y_train, Y_test = train_test_split(df['contents'], df['category'], random_state = 0, test_size=0.3, shuffle=True)

count_vect = CountVectorizer(ngram_range=(1,2), stop_words=stopwords1)
X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer(use_idf=True, smooth_idf=True, norm='l2', sublinear_tf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

lgbm_train = lgbm.Dataset(X_train_tfidf, Y_train)
lgbm_eval = lgbm.Dataset(count_vect.transform(X_test), Y_test, reference=lgbm_train)

params = {
    'boosting_type':'gbdt',
    'objective':'multiclass',
    'learning_rate': 0.02,
    'num_class': 3,
    'early_stopping': 100,
    'num_iteration': 2000, 
    'num_leaves': 31,
    'is_enable_sparse': 'true',
    'tree_learner': 'data',
    'max_depth': 4, 
    'n_estimators': 50  
    }

clf_gbm = lgbm.train(params, valid_sets=lgbm_eval)
predicted_LGBM = clf_gbm.predict(count_vect.transform(X_test))

print(accuracy_score(Y_test, predicted_LGBM))

Then I got an error as:

ValueError: could not convert string to float: 'b'  

I also convert 'category' column ['a', 'b', 'c'] to int as [0, 1, 2] but got an error as

TypeError: Expected np.float32 or np.float64, met type(int64).

What's wrong with my code?
Any advice / suggestions will be greatly appreciated.
Thanks in advance.

SY9
  • 165
  • 2
  • 11
  • Curios. Why use a classifier built for categorical data when the features are sparse and non categorical?? – Isbister Nov 27 '18 at 23:07
  • @Isbister This code is for classification with the one-hot vector of extracted thousands of sentences so the data is sparse. In one-hot vector made by Scikit-learn CountVect is numerical since CV counts words in the sentence and put them to the vector. I think this is a bit classical but typical way for the text classification. – SY9 Dec 03 '18 at 01:52

1 Answers1

5

I managed to deal with this issue. Very simple but noted here for reference.

Since LightGBM expects float32/64 for input, so 'categories' should be number, rather than str. And input data should be converted to float32/64 using .astype().

Changes1:
added following 4 lines after X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

 X_train_tfidf = X_train_tfidf.astype('float32')
 X_test_counts = X_test_counts.astype('float32')   
 Y_train = Y_train.astype('float32')
 Y_test = Y_test.astype('float32')

changes2:
just convert 'category' column
from [A, B, C, ...] to [0.0, 1.0, 2.0, ...]

Maybe just assigning attirbute as TfidfVecotrizer(dtype=np.float32) works in this case.
And putting vectorized data to LGBMClassifier will be much simpler.

Update
Using TfidfVectorizer is much simpler:

tfidf_vec = TfidfVectorizer(dtype=np.float32, sublinear_tf=True, use_idf=True, smooth_idf=True)
X_data_tfidf = tfidf_vec.fit_transform(df['contents'])
X_train_tfidf = tfidf_vec.transform(X_train)
X_test_tfidf = tfidf_vec.transform(X_test)

clf_LGBM = lgbm.LGBMClassifier(objective='multiclass', verbose=-1, learning_rate=0.5, max_depth=20, num_leaves=50, n_estimators=120, max_bin=2000,)
clf_LGBM.fit(X_train_tfidf, Y_train, verbose=-1)
predicted_LGBM = clf_LGBM.predict(X_test_tfidf)
SY9
  • 165
  • 2
  • 11
  • I am still getting this error even after adding those 4 lines you suggested. – phaigeim Jan 18 '19 at 19:18
  • TypeError: Expected np.float32 or np.float64, met type(int64) This is happening during "train" method – phaigeim Jan 22 '19 at 08:15
  • 1
    @phaigeim Have you tried using TfidfVectorizer instead CountVectorizer + Tfidftransformer? TfidfVect support dtype option so you can convert data type. See Update in my answer above. If it doesn't resolve your issue, directly convert your dataset to float before voctorize data. – SY9 Jan 26 '19 at 01:21