2

I want to train word2vec and fasttext to get vectors for a specific dataset that I have.

What should my model take as input?

My file is like this:

Customer_4: I want to book a ticket to New York.
Agent_9: Okay, when do you want the tickets for
Customer_4: hmm, wait a sec
Agent_9: Sure
Customer_4: When is the least expensive to fly

Now, How should I prepare my data for word2vec to run? Does the word2vec model take inter sentence similaarity into account, i.e. should i not prepare the corpus sentence wise.

tstseby
  • 1,259
  • 3
  • 10
  • 20

1 Answers1

1

One way would be that you first split your document into lines, then for each line, split the line into tokens. Then you end up with a corpus of list of list of tokens. After that, you can feed it into the gensim word2vec model.

Ahmadov
  • 1,567
  • 5
  • 31
  • 48