2

I've been working on a project which is about classifying text documents in the legal domain (Legal Judgment Prediction class of problems).
The given data set consists of 700 legal documents (well balanced in two classes). After the preprocessing, which consists in applying all the best practices (such as deleting stopwords,etc.), there are 3 paragraphs for each document, which I could consider all together or separately. On average, the text documents size is 2285 words.

I aim to use something different from the classical n-grams model (which doesn't take into account any words order or semantic) :

  • Using a Neural Network (Doc2Vec) for transforming the text of each document into a vector in a continuous domain; in order to create a dataset with the vectors, representing the documents, and the corresponding labels (as I said there are 2 possible labels: 0 or 1);
  • Training a SVM for classifying the samples, I've been using a 10-fold cross-validation.

I was wondering if there's someone who has some experience in this particular domain, who can suggest me other ways or how to improve the model since I'm not getting particularly good results: 74% accuracy.

Is it correct using Doc2Vec for transforming text into vectors and using them for feeding a classifier?

My model represantation:

enter image description here

toti08
  • 2,448
  • 5
  • 24
  • 36
hey_rey
  • 103
  • 8
  • A few things... 700 is not a particularly high number of samples, so that's probably a big part of your problem - you might be suffering high variance. More samples should help. Try using CV to tune better hyper parameters for your classifier (also, you can try different classifiers than just SVM). 10-fold CV is relatively high, also. Probably pretty time consuming. You could probably get away with fewer folds for your grid search process – TayTay Oct 01 '18 at 12:56
  • 2
    Your question seems rather vague. What exactly do you want to know in regards to a specific programming problem? What have you tried, where exactly are you stuck? – petezurich Oct 01 '18 at 12:56
  • It's not a specific programming problem, it's about whether it makes sense using this representation of the text. **How would you represent a text document in a continuous domain suitable for training a classifier?** @petezurich – hey_rey Oct 01 '18 at 13:06
  • Thank you @Tgsmith6159, I appreciate your comments. If the problem is that I haven't enough samples, I'm stuck!! Would you use a pre-trained model for getting the vectors from the text? – hey_rey Oct 01 '18 at 13:10

1 Answers1

1

Doc2Vec is a reasonable way to tranform a variable-length text into a summary-vector, and these vectors are often useful for classification – especially topical or sentiment classification (two applications highlighted in the original 'Paragraph Vector' paper).

However, 700 docs is extremely small as a training set. Published work has tended to use corpuses of tens-of-thousands to millions of documents.

Also, your specific classification target – predicting a legal judgment – strikes me as much harder than topical or sentiment classification. Knowing how a case will be decided depends on a large body of outside law/precedent (that's not in the training-set), and logical deductions, sometimes on individual fine points of a situation. Those are things the fuzzy-summary of a single-text-vector are unlikely to capture.

Against that, your reported 74% accuracy sounds downright impressive. (Would a lay person do as well, with just these summaries?) I wonder if there are certain 'tells' in the summaries – with word choices of the summarizer strongly hinting, or downright revealing, the actual judgment. If that's the strongest signal in the text (barring actual domain knowledge & logical reasoning), you might get just-as-good results from a more simple n-grams/bag-of-words representation and classifier.

Meta-optimizing your training parameters might incrementally improve results, but I'd think you'd need a lot more data, and perhaps far more advanced learning techniques, to really approximate the kind of legally-competent human-level predictions you may be aiming for.

gojomo
  • 52,260
  • 14
  • 86
  • 115