0

I'm trying out the Quora Insincere Questions Classification competition (late submission), but there's a weird error that I can't figure out. Here is my code (the relevant parts):

def loss(predict, observed):
  a = predict*observed
  b = predict+observed
  return 2*(a/b)

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

train = train.iloc[0:5000, :]
test = test.iloc[0:1000, :]

qid = test['qid']

train = train.drop('qid', axis=1)
test = test.drop('qid', axis=1)

x_train, x_val, y_train, y_val = train_test_split(train['question_text'], train['target'])

count = CountVectorizer(stop_words='english', ngram_range=(1,1), min_df=1, #tokenizer=LemmaTokenizer()
                       )
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,1), min_df=1, #tokenizer=LemmaTokenizer()
                       )

count.fit(list(x_train), list(x_val))
x_train_count = count.transform(x_train)
x_val_count = count.transform(x_val)

logistic = LogisticRegression()
logistic.fit(x_train_count, y_train)
predictions = logistic.predict_proba(x_val_count)
print("loss: %0.3f " %loss(predictions, y_val))

When I run it, I get this error:

ValueError: operands could not be broadcast together with shapes (1250,2) (1250,)

I know why I got the error: It's because I can't directly multiply two arrays. But here are some dimensions that don't make sense:

x_val_count.shape - (1250, 8411) I assume this is the expanded array of comments (1250 test examples), in numerical form. But the beginning of the printed array is this:

  (0, 1057) 1
  (0, 4920) 1
  (0, 5563) 1
  (1, 2894) 1
  (1, 3403) 1
  (2, 3311) 1
  (3, 1386) 1
  (3, 1646) 1
  (4, 3207) 1
  (4, 3330) 1
  (4, 6111) 1
  (5, 2346) 1
  (5, 4148) 1
  (5, 4441) 1
  (5, 5223) 1
  (5, 5316) 1
  (5, 5378) 1
  (5, 5565) 2
  (5, 7571) 1
  (6, 746)  2
  (6, 983)  1
  (6, 985)  1
  (6, 3182) 1
  (6, 3455) 1
  (6, 4636) 1

That just looks like it has two columns. Why this discrepancy?

predictions.shape - (1250, 2) I don't know why the predictions have two columns. Why not one?

I'm hoping if I know more, I would be able to fix the problem. But does anyone know how I can fix this?

Ronan Venkat
  • 345
  • 1
  • 6
  • 12

1 Answers1

0

There are a couple of questions there, so I'll try to answer them one by one.

x_val_count.shape - (1250, 8411) indicates that there are 1250 samples and 8411 features (where 8411 is the size of your vocabulary). However, scikit-learn's vectorizers store the data in the form of a sparse matrix (indices of non-zero features) for efficiency reasons. This is because there are lots of 0's in the features column (a document - in your case a Quora question - will have hardly 1% of the words from the vocabulary). If you want to convert it to a regular matrix, you can simply call x_val_count.toarray(), however, you may run out of memory because that would be a huge matrix. The output

(0, 1057) 1
(0, 4920) 1
(0, 5563) 1

can be read as "Document 0 has 3 words in it, each occurring once." If you are curious to know what these words are, you can look for them in the count.vocabulary_ dictionary, where words are keys and the indices (1057, 4920, ...), are values.

For your second question regarding predictions.shape - (1250, 2) you are getting 2 columns because you called the predict_proba() of LogisticRegression which returns probabilities of each class (in your case - 2 classes). If you just want the prediction labels, you should call predict().

Sadi
  • 50
  • 1
  • 9