How to prepare data for word2vec in gensim and fasttext?

Question

I want to train word2vec and fasttext to get vectors for a specific dataset that I have.

What should my model take as input?

My file is like this:

Customer_4: I want to book a ticket to New York.
Agent_9: Okay, when do you want the tickets for
Customer_4: hmm, wait a sec
Agent_9: Sure
Customer_4: When is the least expensive to fly

Now, How should I prepare my data for word2vec to run? Does the word2vec model take inter sentence similaarity into account, i.e. should i not prepare the corpus sentence wise.

score 1 · Answer 1 · answered Oct 28 '18 at 23:51

1

One way would be that you first split your document into lines, then for each line, split the line into tokens. Then you end up with a corpus of list of list of tokens. After that, you can feed it into the gensim word2vec model.

answered Oct 28 '18 at 23:51

Ahmadov

1,567
5
31
48

Could you please share an article or code for that? – shamiul97 Nov 18 '19 at 14:05

How to prepare data for word2vec in gensim and fasttext?

1 Answers1