16

Hi Last week Facebook announced Fasttext which is a way to categorize words into bucket. Latent Dirichlet Allocation is also another way to do topic modeling. My question is did anyone do any comparison regarding pro and con within these 2.

I haven't tried Fasttext but here are few pro and con for LDA based on my experience

Pro

  1. Iterative model, having support for Apache spark

  2. Takes in a corpus of document and does topic modeling.

  3. Not only finds out what the document is talking about but also finds out related documents

  4. Apache spark community is continuously contributing to this. Earlier they made it work on mllib now on ml libraries

Con

  1. Stopwords need to be defined well. They have to be related to the context of the document. For ex: "document" is a word which is having high frequency of appearance and may top the chart of recommended topics but it may or maynot be relevant, so we need to update the stopword for that

  2. Sometime classification might be irrelevant. In the below example it is hard to infer what this bucket is talking about

Topic:

  1. Term:discipline

  2. Term:disciplines

  3. Term:notestable

  4. Term:winning

  5. Term:pathways

  6. Term:chapterclosingtable

  7. Term:metaprograms

  8. Term:breakthroughs

  9. Term:distinctions

  10. Term:rescue

If anyone has done research in Fasttext can you please update with your learning?

Nabs
  • 553
  • 5
  • 17

1 Answers1

4

fastText offers more than topic modelling, it is a tool for generation of word embeddings and text classification using a shallow neural network. The authors state its performance is comparable with much more complex “deep learning” algorithms, but the training time is significantly lower.

Pros:

=> It is extremely easy to train your own fastText model,

$ ./fasttext skipgram -input data.txt -output model

Just provide your input and output file, the architecture to be used and that's all, but if you wish to customize your model a bit, fastText provides the option to change the hyper-parameters as well.

=> While generating word vectors, fastText takes into account sub-parts of words called character n-grams so that similar words have similar vectors even if they happen to occur in different contexts. For example, “supervised”, “supervise” and “supervisor” all are assigned similar vectors.

=> A previously trained model can be used to compute word vectors for out-of-vocabulary words. This one is my favorite. Even if the vocabulary of your corpus is finite, you can get a vector for almost any word that exists in the world.

=> fastText also provides the option to generate vectors for paragraphs or sentences. Similar documents can be found by comparing the vectors of documents.

=> The option to predict likely labels for a piece of text has been included too.

=> Pre-trained word vectors for about 90 languages trained on Wikipedia are available in the official repo.

Cons:

=> As fastText is command line based, I struggled while incorporating this into my project, this might not be an issue to others though.

=> No in-built method to find similar words or paragraphs.

For those who wish to read more, here are the links to the official research papers:

1) https://arxiv.org/pdf/1607.04606.pdf

2) https://arxiv.org/pdf/1607.01759.pdf

And link to the official repo:

https://github.com/facebookresearch/fastText

Aanchal1103
  • 917
  • 8
  • 21