How can i generate non-english (french , spanish , italian ) word embedding from english word embedding ?
What are the best ways to generate high quality word embedding for non - english words .
Words may include (samsung-galaxy-s9)
How can i generate non-english (french , spanish , italian ) word embedding from english word embedding ?
What are the best ways to generate high quality word embedding for non - english words .
Words may include (samsung-galaxy-s9)
How can i generate non-english (french , spanish , italian ) word embedding from english word embedding ?
You can't really. Unless you have words which mean exactly the same. If you have know the french word for king, queen, woman and man, you can give those words the embedding of the exact same word in english. They will show the same syntactic and semantic properties that the english words do. But you can't really use the English embeddings to make embeddings for different languages.
What are the best ways to generate high quality word embedding for non - english words
English words and non-english words can be treated the same way. Represent your non english words as strings/tokens and train a w2v model. Use gensim for this. You'll have to find a huge corpus for the language you want. Then you will have to train your model with this huge corpus for a few epochs. Done. Alternatively, look for pre existing models in your required language.
Words may include (samsung-galaxy-s9)
Unless your corpus has words like "samsung-galaxy-s9", your model won't know what it means. Use a corpus which might have more words in the domain you're hoping to use the embeddings for.
For non-english words, you can try to use a bilingual dictionary to translate English words with embedding vectors.
You need a large corpus to generate high-quality word embeddings. For non-english, you need to add the bilingual constraints into the original w2v loss with the input of bilingual corpora.
You can regard the compound word as a whole word or split it according to your applications.