1

I am a beginner in learning Word2Vec and just started to do some study on Word2vec from the Internet. I have gone through almost all the questions in Quora and StackOverflow but didn't get my answer anywhere from the previous questions. So my question is-

  1. Is it possible to apply word2vec in plagiarism detection?
  2. If yes, then will Word2Vec be more efficient in text-based Plagiarism detection than WordNet or any other word embeddings like GloVe, fastText, etc?

Thanks in advance.

amp1590
  • 43
  • 8
  • 1
    Such questions often don't have a single answer. If you have a task and a dataset you apply various methods, choose what works best, try to understand what are shortcomings of other methods. It also seems you're asking this question because you don't have a full picture of how these methods work, and what the plagiarism detection task is. So I feel that instead of looking for an answer at SO or Quora it may be better for you to read some basic ML/NLP books or follow online courses - they introduce topics gradually, so they can be easier to learn from. – Mikhail Korobov Jun 27 '17 at 18:08

1 Answers1

3

Yes, these "dense embedding" models of word meaning like word2vec may be useful in plagiarism detection. (They're also likely useful in obfuscating plagiarism from simple detectors, as they can assist automated transforms on existing text that change the words while keeping the meaning similar.)

Only by testing within a particular system and with respect to quantitative evaluations will you know for sure how well it can work, or whether a particular embedding is better or worse than something like WordNet.

Among word2vec, fastttext, and GloVE, results will probably be very similar – they all use roughly the same info (word co-occurrences within a sliding context window) to make maximally-predictive word-vectors – so they behave very similarly with similar training data.

Any differences are subtle – the non-GLoVe options might work better for very larger vocabularies; fasttext is essentially the word2vec in some modes, but adds new options for either modeling subword ngrams (which can then help to create better-than-random vectors for future out-of-vocabulary words) or optimizing the vectors for classification problems.

But the vectors for known words, which can be trained with plentiful training data, are going to be very similar in capabilities if the training processes are similarly meta-optimized for your task.

gojomo
  • 52,260
  • 14
  • 86
  • 115