-2

I have a dataset which has 3 different columns of relevant text information which I want to convert into doc2vec vectors and subsequently classify using a neural net. My question is how do I convert these three columns into vectors and input into a neural net?

How do I input the concatenated vectors into a neural network?

anmol narang
  • 51
  • 1
  • 6

2 Answers2

0

One way is to get a doc2vec vector for all three documents in a defined order and append them together. Then fit the resulting vector to your neural network.

Another way is to create a column in which each row is a list of 3 strings (representing the three documents) and getting one vector representation of all three documents. See some example code below.

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
model.infer_vector(['theis is a sentence1', 'here is another sentence', 'this represents the third sentence']).tolist()

Once this is done you can initialize your model and train it.

To fit an sklearn clasifier for example sgd, checkout the code snippets below.

from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.0)
d = pd.DataFrame({'vectors':[[1,2,3], [3,6,5], [9,2,4], [1,2,7]], "targets": ['class1', 'class1', 'class2', 'class2']})
d
>>>
      vectors   targets
0   [1, 2, 3]   class1
1   [3, 6, 5]   class1
2   [9, 2, 4]   class2
3   [1, 2, 7]   class2

You can fit an sklearn clasiifier on the vector as follows.

clf.fit(X = d.vectors.values.tolist(), y =d.targets)

>>>
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

You can then use this classifier to predict values.

Samuel Nde
  • 2,565
  • 2
  • 23
  • 23
  • appending is done. How do i feed the newly formed vectors in a sklearn classifier and a neural network? any references? – anmol narang Mar 26 '19 at 18:08
  • but that way the individuality of the three documents would be lost..would'nt it? – anmol narang Mar 26 '19 at 18:24
  • @anmolnarang That is true but it will be accounted for in the doc2vec representation (which unfortunately is a black box). Also, if my answer is helping you, please upvote it. I am also editing it to include more info. – Samuel Nde Mar 26 '19 at 18:31
0

I would suggest converting each text field into a vector separately using doc2vec, concatenating the vectors and feed the resulting vector into a neural network.

  • after concatenation the dimensions triple in this case and after that i get the error something to do with passing sequences – anmol narang Mar 26 '19 at 18:07
  • I don't know the exact situation you are in, but if you are able to pick the shape of the neural network yourself you can just pick a shape that allows the neural network to handle the triple amount of data. If you are using a neural network that has already been trained on other data (so you can't change its shape) there are two options: 1) start by concatenating the text fields, feed the result through doc2vec and finally through the neural net. This might make it difficult for your model to take the different columns into account since all of them are squished together. 2) ... – Stefán Erlingur Jónsson Mar 26 '19 at 18:29
  • ... Create a new neural net that summarizes the data from the three doc2vec operations to something that you can feed through your neural net. – Stefán Erlingur Jónsson Mar 26 '19 at 18:30