4

You known POS is like 'NP', 'VERB'. How can I combine these features to word2vec?

Just like the follow vectors?

keyword    V1         V2          V3         V4            V5         V6   
corruption 0.07397  0.290874    -0.170812   0.085428     'VERB'    'NP' 
people      ..............................................................
budget      ...........................................................
kunif
  • 4,060
  • 2
  • 10
  • 30
Wei Chen
  • 51
  • 1
  • 4

2 Answers2

7

A first, naive, solution is simply to concatenate the embedding vector with a one hot encoded vector representing the POS tag.

If you want to do something fancier however, you should find a proper way for weighting these different features.

For example you could use XGboost: given a not-normalized set of features (embeddings + POS in your case) assign weights to each of them according to a specific task.

As an alternative, you can use neural networks for combining these features into a unique meaningful hidden representation.

Assuming that the context of each word is important in your task, you could do the following:

  • compute word embeddings (N dimensional)
  • compute pos (1 hot encoded vector)
  • run a LSTM or a similar recurrent layer on the pos.
  • for each word, create a representation consisting of its word embedding concatenated with its corresponding output from the LSTM layer.
  • use a fully connected layer to create a consistent hidden representation.

P.S. note that the use of the recurrent layer is not mandatory, you could also try to concatenate pos and embedding directly and then apply the fully connected layer.

alsora
  • 551
  • 5
  • 17
1

If you want add the POS tags as a features to the embedding vectors, you could simply add them to the numpy arrays representing the word vectors. But, I guess, such trick should not work because the dimensionality of word vectors is high, and the impact of a single added feature would be low.

Extending word vectors with POS tags is a good practice (because it could deal with polysemy, for example), but usually POS tags are added in a different way. You should annotate your training corpus with POS tags at first, and after that you could train your model on this corpus (models in vectors.nlpl repository are trained in this way). As a result, you should obtain something like this:

keyword            V1       V2           V3         V4             
corruption_NOUN  0.07397  0.290874    -0.170812   0.085428    
people_NOUN      .........................................
budget_NOUN      .........................................
Amir
  • 1,926
  • 3
  • 23
  • 40
  • Thank you for your answer, do you know any github repos with notebooks? – John Smith Nov 28 '20 at 10:41
  • @Xiaoshi you can get the pos tags using spacy (https://spacy.io/usage/linguistic-features), and then add them to each word in your word embedding dictionary – Amir Nov 28 '20 at 11:10