-1

I am trying to built a custom made word embedding model with fastText, that represents my data (list of sentences) as vectors so I can "feed" it to a Keras CNN for abusive language detection.

My tokenised data is stored in a list like this:

data = [['is',
      'this',
      'a',
      'news',
      'if',
      'you',
      'have',
      'no',
      'news',
      'than',
      'shutdown',
      'the',
      'channel'],
     ['if',
      'interest',
      'rate',
      'will',
      'hike',
      'by',
      'fed',
      'then',
      'what',
      'is',
      'the',
      'effect',
      'on',
      'nifty']]

I am currently applying the fastText model like this:

model = fastText(data, size=100, window=5, min_count=5, workers=16, sg=0, negative=5)

And then I perform:

model = FastText(sentences, min_count=1)

documents = []

for document in textList:
    word_vectors = []
    for word in document: 
        word_vectors.append(model.wv[word])
    documents.append(np.concatenate(word_vectors)

document_matrix = np.concatenate(documents)

Obviously, the document_matrix doesn't fit as the input for my Keras model:

from keras.models import Sequential
from keras import layers
from keras.layers import Dense, Activation

model = Sequential()
model.add(layers.Conv1D(filters=250, kernel_size = 4, padding = 'same', input_shape=( 1,))) 
model.add(layers.GlobalMaxPooling1D()) 
model.add(layers.Dense(250, activation='relu')) 
model.add(layers.Dense(3, activation='sigmoid')) 

I am stuck and running out of ideas how to make the output of the embedding fit the input for Keras.

Thank you very much in advance, you guys are the best !

Lisa

Lisa
  • 9
  • 4

1 Answers1

0

You can take each word representation from word2vec model with model[YOURKEYWORD]. Some word embeddings may not be exist in your word2vec model so you can use try-catch in your code.

Batuhan B
  • 1,835
  • 4
  • 29
  • 39
  • Hi ! Thanks a lot ! I switched to FastText to avoid the OOV problem and did the following: documents = [] for document in textList: word_vectors = [] for word in document: # or your logic for separating tokens word_vectors.append(model.wv[word]) documents.append(np.concatenate(word_vectors)) document_matrix = np.concatenate(documents) Now, the matrix has a shape (22938600,) which doesn't fit Sequential's input shape, do you have any idea what I can do ? thank you – Lisa Apr 04 '20 at 12:20
  • could you please edit your question based on this comment. It is hard to understand the code from your comment. – Batuhan B Apr 04 '20 at 12:45