0

Right now I am working on a text classification (trying to predict if a Twitter response is even human or bot generated). The task is actually a closed kaggle competition, and more details as well as datasets that were used could be found here: enter link description here

My problem is that when I submit my solution on the site, I can not get more than 50% accuracy, even if I tried several well known solutions to get a more higher performance. Due to this I think that the problem could be even a conceptual error in my code, even applying not suitable tehniques for my case.

What I tried so far:

  1. Using the stop_words built in list for CountVectorizer.

  2. Trying to get rid features with extremely low and high frequency(I passed the max_df = 0.3 and min_df = 0.05 arguments to CountVectorizer object)

  3. I used bi-grams

Below you can find my entire code (some bad indents could be found due to copy and paste):

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.naive_bayes import MultinomialNB
    import csv
    import numpy
    from csv import DictReader
    target_values=[]
    target_values_validation=[]
    list_response_train=[]
    list_response_validation=[]
    predictions=[]
    id=[]
    with open('train.txt') as f:
       reader = DictReader(f, delimiter='\t')
       for row in reader:
          target_values.append(int(row['human-generated']))
          row['response'] = row['response'].replace('@@ ', '')
          row['response'] = row['response'].replace('<at>', '')
          row['response'] = row['response'].replace('<url>', '')
          row['response'] = row['response'].replace('<number>', '')
          row['response']=row['response'].replace('<first_speaker>', '')
          row['response'] = row['response'].replace('<second_speaker>', '')
          row['response'] = row['response'].replace('<third_speaker>', '')
          list_response_train.append(row['response'])
y_train=numpy.asarray(target_values)
y_train=y_train[:, numpy.newaxis]

count_vector=CountVectorizer(stop_words='english', ngram_range=(2, 2))
X_train_counts=count_vector.fit_transform(list_response_train)
print(X_train_counts.shape)
print(y_train.shape)

tf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tf=tf_transformer.transform(X_train_counts)
print(X_train_tf.shape)
target_names=['chatbox text', 'human text']


clf = MultinomialNB().fit(X_train_tf, y_train)
with open('validation.txt') as f:
    reader = DictReader(f, delimiter='\t')
    for row in reader:
        target_values_validation.append(int(row['human-generated']))
        row['response']=row['response'].replace('<first_speaker>', '')
        row['response'] = row['response'].replace('<second_speaker>', '')
        row['response'] = row['response'].replace('<third_speaker>', '')
        row['response'] = row['response'].replace('@@ ', '')
        row['response'] = row['response'].replace('<at>', '')
        row['response'] = row['response'].replace('<url>', '')
        row['response'] = row['response'].replace('<number>', '')
        list_response_validation.append(row['response'])
y_validation=numpy.asarray(target_values_validation)
y_validation=y_validation[:, numpy.newaxis]

X_new_counts = count_vector.transform(list_response_validation)
X_new_tfidf = tf_transformer.transform(X_new_counts)
print(X_new_tfidf.shape)

predicted = clf.predict_proba(X_new_tfidf)

print(predicted)
print(predicted.shape)
print(y_validation.shape)

m,n=predicted.shape
for j in range(0, m):
    predictions.append(predicted[j][1])

for k in range(0, len(predictions)):
    id.append(k)


with open("submit.csv", "a") as f:
    writer = csv.writer(f)
    for row in zip(id, predictions):
        writer.writerow(row)

Every suggestion is highly appreciated.

SurvivalMachine
  • 7,946
  • 15
  • 57
  • 87

0 Answers0