I am working on classifying texts and images of scientific articles. From the texts I use title and abstract. So far I have achieved good results using an SVM for the texts and not that good using a CNN for the images. I still did a multimodal classification, which did not show any classification improvement.
What I would like to do now is to use the svm and cnn predictions to classify, something like a vote ensemble. However the VotingClassifier from sklearn does not accept mixed inputs. You would have some idea of how I could implement or some guide line.
Thank you!