I've been working on a project which is about classifying text documents in the legal domain (Legal Judgment Prediction class of problems).
The given data set consists of 700 legal documents (well balanced in two classes). After the preprocessing, which consists in applying all the best practices (such as deleting stopwords,etc.), there are 3 paragraphs for each document, which I could consider all together or separately. On average, the text documents size is 2285 words.
I aim to use something different from the classical n-grams model (which doesn't take into account any words order or semantic) :
- Using a Neural Network (Doc2Vec) for transforming the text of each document into a vector in a continuous domain; in order to create a dataset with the vectors, representing the documents, and the corresponding labels (as I said there are 2 possible labels: 0 or 1);
- Training a SVM for classifying the samples, I've been using a 10-fold cross-validation.
I was wondering if there's someone who has some experience in this particular domain, who can suggest me other ways or how to improve the model since I'm not getting particularly good results: 74% accuracy.
Is it correct using Doc2Vec for transforming text into vectors and using them for feeding a classifier?
My model represantation: