Right now I am working on a text classification (trying to predict if a Twitter response is even human or bot generated). The task is actually a closed kaggle competition, and more details as well as datasets that were used could be found here: enter link description here
My problem is that when I submit my solution on the site, I can not get more than 50% accuracy, even if I tried several well known solutions to get a more higher performance. Due to this I think that the problem could be even a conceptual error in my code, even applying not suitable tehniques for my case.
What I tried so far:
Using the stop_words built in list for CountVectorizer.
Trying to get rid features with extremely low and high frequency(I passed the
max_df = 0.3
andmin_df = 0.05
arguments to CountVectorizer object)I used bi-grams
Below you can find my entire code (some bad indents could be found due to copy and paste):
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
import csv
import numpy
from csv import DictReader
target_values=[]
target_values_validation=[]
list_response_train=[]
list_response_validation=[]
predictions=[]
id=[]
with open('train.txt') as f:
reader = DictReader(f, delimiter='\t')
for row in reader:
target_values.append(int(row['human-generated']))
row['response'] = row['response'].replace('@@ ', '')
row['response'] = row['response'].replace('<at>', '')
row['response'] = row['response'].replace('<url>', '')
row['response'] = row['response'].replace('<number>', '')
row['response']=row['response'].replace('<first_speaker>', '')
row['response'] = row['response'].replace('<second_speaker>', '')
row['response'] = row['response'].replace('<third_speaker>', '')
list_response_train.append(row['response'])
y_train=numpy.asarray(target_values)
y_train=y_train[:, numpy.newaxis]
count_vector=CountVectorizer(stop_words='english', ngram_range=(2, 2))
X_train_counts=count_vector.fit_transform(list_response_train)
print(X_train_counts.shape)
print(y_train.shape)
tf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tf=tf_transformer.transform(X_train_counts)
print(X_train_tf.shape)
target_names=['chatbox text', 'human text']
clf = MultinomialNB().fit(X_train_tf, y_train)
with open('validation.txt') as f:
reader = DictReader(f, delimiter='\t')
for row in reader:
target_values_validation.append(int(row['human-generated']))
row['response']=row['response'].replace('<first_speaker>', '')
row['response'] = row['response'].replace('<second_speaker>', '')
row['response'] = row['response'].replace('<third_speaker>', '')
row['response'] = row['response'].replace('@@ ', '')
row['response'] = row['response'].replace('<at>', '')
row['response'] = row['response'].replace('<url>', '')
row['response'] = row['response'].replace('<number>', '')
list_response_validation.append(row['response'])
y_validation=numpy.asarray(target_values_validation)
y_validation=y_validation[:, numpy.newaxis]
X_new_counts = count_vector.transform(list_response_validation)
X_new_tfidf = tf_transformer.transform(X_new_counts)
print(X_new_tfidf.shape)
predicted = clf.predict_proba(X_new_tfidf)
print(predicted)
print(predicted.shape)
print(y_validation.shape)
m,n=predicted.shape
for j in range(0, m):
predictions.append(predicted[j][1])
for k in range(0, len(predictions)):
id.append(k)
with open("submit.csv", "a") as f:
writer = csv.writer(f)
for row in zip(id, predictions):
writer.writerow(row)
Every suggestion is highly appreciated.