-1

I'm trying to create a text classifier to determine whether an abstract indicates an access to care research project. I am importing from a dataset that has two fields: Abstract and Accessclass. Abstract is a 500 word description about the project and Accessclass is 0 for not access-related and 1 for access-related. I'm still in the developing stages, however when I looked at the unigrams and bigrams for 0 and 1 labels, they were the same, despite very distinctly different tones of text. Is there something I'm missing in my code? For example, am I accidentally double adding negative or positive? Any help is appreciate.

import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes

df = pd.read_excel("accessclasses.xlsx")
df.head()

from io import StringIO
col = ['accessclass', 'abstract']
df = df[col]
df = df[pd.notnull(df['abstract'])]
df.columns = ['accessclass', 'abstract']
df['category_id'] = df['accessclass'].factorize()[0]
category_id_df = df[['accessclass', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'accessclass']].values)
df.head()

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=4, norm='l2', encoding='latin-1', ngram_range=(1, 
2), stop_words='english')
features = tfidf.fit_transform(df.abstract).toarray()
labels = df.category_id
print(features.shape)

from sklearn.feature_selection import chi2
import numpy as np
N = 2
for accessclass, category_id in sorted(category_to_id.items()):
   features_chi2 = chi2(features, labels == category_id)
   indices = np.argsort(features_chi2[0])
   feature_names = np.array(tfidf.get_feature_names())[indices]
   unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
   bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
   print("# '{}':".format(accessclass))
   print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
   print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))
  • Try setting `ngram_range` argument in the `TfidfVectorizer` to be equal to `(1, 2)`. So, your vectorizer should be `tfidf = TfidfVectorizer(ngram_range=(1, 2), sublinear_tf=True, min_df=4, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')` – Anwarvic Nov 05 '19 at 17:24
  • I'm getting a keyword argument error because ngram_range(1,2) is in there twice. However, I thought my ngram_range in my code above was equal to (1, 2) already. Maybe I'm missing something? – tenebris silentio Nov 05 '19 at 17:52
  • I'm sorry, I didn't see it – Anwarvic Nov 05 '19 at 18:01
  • could you share a sample of the data? – Anwarvic Nov 05 '19 at 19:40
  • Sure... I've added it here: https://github.com/inthetoast/pythonstuff/blob/master/accessclasses.xlsx – tenebris silentio Nov 05 '19 at 20:14

1 Answers1

1

I think the problem in your code is setting min_df with a big number like 4 on this small dataset. According to your data that you have posted, the most common words are stopwords that will be removed after using TfidfVectorizer. Here they are:

to :  19
and :  11
a :  6
the :  6
are :  6
of :  6
for :  5
is :  4
in :  4
will :  4
access :  4
I :  4
times :  4
healthcare :  3
more :  3
have :  3
with :  3
...

And these are the unigram... the bigram count will be way lower.

You can solve that by either one of these two options:

  • Setting the stopwords argument to None like so stopwords=None
  • Setting min_df to be lower than 4 like 1 or 2 for example.

I recommend using the second option as the first will return stopwords as correlated which isn't helpful at all. I have tried using min_df=1 and here is the result:

  . Most correlated unigrams:
. times
. access

  . Most correlated bigrams:
. enjoyed watching
. wait times
Anwarvic
  • 12,156
  • 4
  • 49
  • 69
  • This is helpful, I appreciate the response. Unfortunately, my problem remains that this is identifying 0 (negative) and 1 (positive) unigrams and bigrams as if they were labeled the same. For example, I made sure the word "access" wasn't labeled in any abstract marked as a 0. Yet, it's still identifying it as a correlated unigram. Am I missing something in my code that's causing it to do this? Thanks in advance. – tenebris silentio Nov 08 '19 at 17:02
  • That's because `TfidfVectorizer` has nothing to do with classification... It's just a model to extract features. Use the `features` extracted from `TfidfVectorizer` to train a classifier. A classifier like SVM or logistic regression – Anwarvic Nov 08 '19 at 17:53
  • Thank you for clarifying, I appreciate it. – tenebris silentio Nov 08 '19 at 17:54