I have two sentences (questions) data, and a label where they mean the same or not (is_duplicate). What I am trying to do, is to build a model above two universal sentence encoders, which will classify whether they are equal or not. Here is my code:
# actual content is inside description column
document1 = DocumentAssembler()\
.setInputCol("question1")\
.setOutputCol("document1")
document2 = DocumentAssembler()\
.setInputCol("question2")\
.setOutputCol("document2")
# we can also use sentence detector here
# if we want to train on and get predictions for each sentence
# downloading pretrained embeddings
use1 = UniversalSentenceEncoder.pretrained()\
.setInputCols(["document1"])\
.setOutputCol("sentence_embeddings1")
use2 = UniversalSentenceEncoder.pretrained()\
.setInputCols(["document2"])\
.setOutputCol("sentence_embeddings2")
#assembler
assembler = VectorAssembler()\
.setInputCols(["sentence_embeddings1", "sentence_embeddings2"])\
.setOutputCol("sentences_embeddings")
# the classes/labels/categories are in category column
classsifierdl = ClassifierDLApproach()\
.setInputCols(["sentences_embeddings"])\
.setOutputCol("class")\
.setLabelColumn("is_duplicate")\
.setMaxEpochs(5)\
.setEnableOutputLogs(True)
use_clf_pipeline = Pipeline(
stages = [
document1,
document2,
use1,
use2,
assembler,
classsifierdl
])
However, when I run use_pipelineModel = use_clf_pipeline.fit(df)
, I run into:
IllegalArgumentException: Data type array<structannotatorType:string,begin:int,end:int,result:string,metadata:map<string,string,embeddings:array>> of column sentence_embeddings1 is not supported. Data type array<structannotatorType:string,begin:int,end:int,result:string,metadata:map<string,string,embeddings:array>> of column sentence_embeddings2 is not supported.
My questions are:
- Does spraknlp is the best tool for this model?
- How can I use VectorAssembler, what I need I guess is to concat a dense vector of each of them...