2

I am using Facebook's Fasttext for performing text classification. I wanted to know how fasttext library handle the numbers in a text string provided as input for word vectorization.

  1. Do fasttext typecast each number as a string before creating word vectors?

    For e.g. 1124 to " 1124 "

  2. Or any other transformation/preprocessing is performed in the background before training?

    For e.g. 1124 to " one one two four "

What should be the most optimal approach to handle numerical data if my input text to fasttext contains numbers?

DK818
  • 135
  • 6

1 Answers1

3

Fasttext doesn't do any preprocessing of numeric tokens. They are treated like other whitespace-separated "words".

Unless you already have a specific problem with fasttext and numbers in your input, I wouldn't worry about what fasttext does with the numbers. Just use it as normal.

If you have a lot of numbers and they're causing problems - this is possible since fasttext likely doesn't have any useful vectors for most specific numbers - you can pre-process your input to replace them with <NUMBER> or another dummy token. That way these sentences will be the same to fasttext:

  1. I ate 1023 oranges.
  2. I ate 1024 oranges.

Whether you want to treat those as the same or not depends on your application.

polm23
  • 14,456
  • 7
  • 35
  • 59