2

I'm using official FastText python library (v0.9.2) for intents classification.

import fasttext

model = fasttext.train_supervised(input='./test.txt',
  loss='softmax',
  dim=200,
  bucket=2000000,
  epoch=25,
  lr=1.0)

Where test.txt contains just one sample file like:

__label__greetings hi

and predict two utterances the results are:

print(model.words)
print('hi', model.predict('hi'))
print('bye', model.predict('bye'))
app_1  | ['hi']
app_1  | hi (('__label__greetings',), array([1.00001001]))
app_1  | bye ((), array([], dtype=float64))

This is my expected output, meanwhile if a set two samples for the same label:

__label__greetings hi
__label__greetings hello

The result for OOV is not correct.

app_1  | ['hi', '</s>', 'hello']
app_1  | hi (('__label__greetings',), array([1.00001001]))
app_1  | bye (('__label__greetings',), array([1.00001001]))

I understand that the problem is with </s> token, maybe \n in text file?, and when there isn't any word on vocabulary the text is replaced by </s>. There are any train option or way to skip this behavior?

Thanks!

Tzomas
  • 704
  • 5
  • 17

2 Answers2

2

In addition to gojomo's answer, we can say that your training dataset is absolutely too small.

If you don't have a significant annotated dataset, you can try zero shot classification: starting from a pretrained language model, you only set some labels and let the model try to classify sentences.

Here you can see and test an interesting demo.

Read also this good article about zero shot classification, with theory and implementation.

1

FastText is a big, data-hungry algorithm that starts with random-initialization. You shouldn't expect results to be sensible or indeed match any set of expectations on toy-sized datasets - where (for example) 100%-minus-epsilon of your n-gram buckets won't have received any training.

I also wouldn't expect supervised mode to ever reliably predict no labels on realistic data-sets – it expects all of its training data to have labels, and I've not seen mention of its use to predict an implied 'ghost' category of "not in training data" versus a single known label (as in 'one-class classification').

(Speculatively, I think you might have to feed FastText supervised mode explicitly __label__not-greetings labeled contrast data – perhaps just synthesized random strings if you've got nothing else – in order for it to have any hope of meaningfully predicting "not-greetings".)

Given that, I'd not consider your first result for the input bye correct, nor the second result not correct. Both are just noise results from an undertrained model being asked to make a kind of distinction it's not known for being able to make.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • thank you for your response, I understand your point, fasttext is another text classifier model after all, my confusion was because i used nodejs (binding) implementation and work like I exposed, after I read your comment I check the implementation itself and previous to the prediction it check if tokens are in vocabulary. Thank again for you clarification. – Tzomas Dec 10 '20 at 10:18