Trying to concat two sentence embedding using SparkNLP

Question

I have two sentences (questions) data, and a label where they mean the same or not (is_duplicate). What I am trying to do, is to build a model above two universal sentence encoders, which will classify whether they are equal or not. Here is my code:

# actual content is inside description column
document1 = DocumentAssembler()\
    .setInputCol("question1")\
    .setOutputCol("document1")

document2 = DocumentAssembler()\
    .setInputCol("question2")\
    .setOutputCol("document2")
    
# we can also use sentence detector here 
# if we want to train on and get predictions for each sentence
# downloading pretrained embeddings
use1 = UniversalSentenceEncoder.pretrained()\
 .setInputCols(["document1"])\
 .setOutputCol("sentence_embeddings1")
use2 = UniversalSentenceEncoder.pretrained()\
 .setInputCols(["document2"])\
 .setOutputCol("sentence_embeddings2")
 

#assembler

assembler = VectorAssembler()\
  .setInputCols(["sentence_embeddings1", "sentence_embeddings2"])\
  .setOutputCol("sentences_embeddings")

# the classes/labels/categories are in category column
classsifierdl = ClassifierDLApproach()\
  .setInputCols(["sentences_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("is_duplicate")\
  .setMaxEpochs(5)\
  .setEnableOutputLogs(True)
use_clf_pipeline = Pipeline(
    stages = [
        document1,
        document2,
        use1,
        use2,
        assembler,
        classsifierdl
    ])

However, when I run use_pipelineModel = use_clf_pipeline.fit(df), I run into:

IllegalArgumentException: Data type array<structannotatorType:string,begin:int,end:int,result:string,metadata:map<string,string,embeddings:array>> of column sentence_embeddings1 is not supported. Data type array<structannotatorType:string,begin:int,end:int,result:string,metadata:map<string,string,embeddings:array>> of column sentence_embeddings2 is not supported.

My questions are:

Does spraknlp is the best tool for this model?
How can I use VectorAssembler, what I need I guess is to concat a dense vector of each of them...

Trying to concat two sentence embedding using SparkNLP

0 Answers0