I am using the language model ELMo - https://allennlp.org/elmo to represent my text data as a numerical vector. This vector will be used as training data for a simple sentiment analysis task.
In this case the data is not in english, so I downloaded a custom ELMo model from - https://github.com/HIT-SCIR/ELMoForManyLangs (i assume this behavs similar as the offical allennlp repo)
To convert a text document to an ELMo embedding the function sents2elmo
is used. The argument is a list of tokenized sentences if I understood the documentation correct.
So one sample in my training data could be embedded as following:
from elmoformanylangs import Embedder
embedder = Embedder('custom_language')
embeddings = embedder.sents2elmo([['hello', 'world', 'how', 'are', 'you', '?'],
['am', 'great', 'thanks', '!']])
This will return a list of two numpy arrays, one for each sentence, and each token in the sentence will be represented as one vector of size 1024. And since the default parameter of sents2elmo(output_layer)
is -1, this vector represents the average of the 3 internal layers in the language model.
How can the embeddings be represented as a 1D array? Shall I just average all the word vectors for one sentence. And then average all the sentence vectors?
sentence_1 = np.mean(embeddings[0], axis=0)
sentence_2 = np.mean(embeddings[1], axis=0)
document = np.mean([sentence_1, sentence_2], axis=0)
Does this approach destroy any information? If so, are there other ways of doing this?
Thanks!