Training fasttext word embedding on your own corpus

Question

I want to train fasttext on my own corpus. However, I have a small question before continuing. Do I need each sentences as a different item in corpus or can I have many sentences as one item?

For example, I have this DataFrame:

 text                                               |     summary
 ------------------------------------------------------------------
 this is sentence one this is sentence two continue | one two other
 other similar sentences some other                 | word word sent

Basically, the column text is an article so it has many sentences. Because of the preprocessing, I no longer have full stop .. So the question is can I do something like this directly or do I need to split each sentences.

docs = df['text']
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(docs)

From the tutorials I read, I need list of words for each sentences but what if I have list of words from an article? What are the differences? Is this the right way of training fasttext in your own corpus?

Thank you!

score 1 · Accepted Answer · answered Oct 15 '21 at 17:29

1

FastText requires text as its training data - not anything that's pre-vectorized, as if by TfidfVectorizer. (If that's part of your FastText process, it's misplaced.)

The Gensim FastText support requires the training corpus as a Python iterable, where each item is a list of string word-tokens.

Each list-of-tokens is typically some cohesive text, where the neighboring words have the relationship of usage together in usual natural-language. It might be a sentence, a paragraph, a post, an article/chapter, or whatever. Gensim's only limitation is that each text shouldn't be more than 10,000 tokens long. (If your texts are longer than that, they should be fragmented into separate 10,000-or-fewer parts. But don't worry too much about the loss of association around the split points - in training sets sufficiently large for an algorithm like FastText, any such loss-of-contexts is negligible.)

answered Oct 15 '21 at 17:29

gojomo

52,260
14
86
115

So, if the text is less than 10k words. It IS OK to use them as it is? Meaning I don't need to split each sentences into one list of tokens? Also, is this valid for both FastText and Gensim FastText? There should not by any difference in the embedding matrix produced by these two methods, right? – BlueMango Oct 16 '21 at 12:02
Gensim always needs each of the texts as a list-of-tokens - so you need to split strings into word-lists, always. Gensim (at least through the latest 4.1 releases of 2021) will always ignore tokens past the 10,000th position - but if each text is already smaller than that, you'll never hit, and thus don't need to worry about, that limit. I believe the Python wrapper for Facebook's FastText still only takes a file-on-disk as its corpus – so it's doing its own splitting of each line of the file into words, by the whitespace on each line. – gojomo Oct 17 '21 at 21:34
Because of both intentional randomization in the algorithm, & unavoidable ordering-randomization in efficient multithreaded implementations, the embeddings produced from any run won't be exactly the same as from any other run. It's not even expected that individual words will be at similar coordinates, from run to run, even using the same library. Rather, the *overall* set of vectors, & their relative distances/directions, should be about-as-useful even with all the jitter of exact positions & rotations/transformations from the training variances. – gojomo Oct 17 '21 at 21:41

Training fasttext word embedding on your own corpus

1 Answers1