0

I have gathered over 20,000 legal pleadings in PDF format. I am an attorney, but also I write computer programs to help with my practice in MFC/VC++. I'd like to learn to use neural networks (unfortunately my math skills are limited to college algebra) to classify documents filed in lawsuits.

My first goal is to train a three layer feed forward neural network to recognize whether a document is a small claims document (with the letters "SP" in the case number), or whether it is a regular document (with the letters "CC" in the case number). Every attorney puts some variant of the word "Case:" or "Case No" or "Case Number" or one of an infinite variations of that. So I've taken the first 600 characters (all attorneys will put the case number within the first 600 chars), and made a CSV database with each row being one document, with 600 columns containing the ASCII codes of the first 600 characters, and the 601st character is either a "1" for regular cases, or a "0" for small claims.

I then run it through the neural network program coded here: https://takinginitiative.wordpress.com/2008/04/23/basic-neural-network-tutorial-c-implementation-and-source-code/ (Naturally I update the program to handle 600 neurons, with one output), but when I run through the accuracy is horrible - something like 2% on the training data, and 0% on the general set. 1/8 of documents are for non-small claims cases.

Is this the sort of problem a Neural Net can handle? What am I doing wrong?

1 Answers1

0

So I've taken the first 600 characters (all attorneys will put the case number within the first 600 chars), and made a CSV database with each row being one document, with 600 columns containing the ASCII codes of the first 600 characters, and the 601st character is either a "1" for regular cases, or a "0" for small claims.

Looking at each and every character at the beginning of the document independently will be very inaccurate. Rather than consider the characters independently, first tokenize the first 600 characters into words. Use those words as input to your neural net, rather than individual characters.

Note that once you have tokenized the first 600 characters, you may will find a distinctly finite list of tokens that mean "case number", removing the need for a neural net.

The Standford Natural Language Processor provides this functionality. You can find a .NET compatible implementation available in NuGet.

Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • Thank you. Can you perhaps tell me why using tokenized words instead of individual characters is inaccurate? I eventually would like to create something that will identify case numbers, and classify documents (this is discovery, this is an answer to the complaint, this is a response to the document that was propounded two weeks ago), and perhaps eventually, even have a machine propose a response to documents propounded based upon a study of prior documents that have been created by other lawyers. I'm just curious to what extent we can replace legal professionals with machines. – yzcyxisyxuyz Oct 12 '15 at 00:53
  • Using individual characters is inaccurate. Using tokens is more accurate. The reason is that, by using tokens, you are reducing the complexity of the problem. You are grouping related characters together to form tokens. Using individual characters would also work in theory, but it would require a vastly larger training set because there are far more permutations. – Eric J. Oct 12 '15 at 01:15
  • Thank you so much. That sheds light on it. I will give that a try. So is there a formula or method that will dictate how large the data set must be to handle any given NN problem? Just out of curiosity, how large would my data set need to be? Within those 600 chars is either an SP or a CC, surrounded by numbers, dashes, or spaces. Is it really such a complicated problem that the NN can't figure it out with 20000 documents? Wow. Thanks again. – yzcyxisyxuyz Oct 12 '15 at 01:36
  • I'm not sure. I'm not a neural network programmer, I just play one on TV. (By that I mean I have dabbled a bit but am far from an expert). – Eric J. Oct 12 '15 at 01:39
  • Final question: When I tokenize the words, how do I convert them to numbers for input into the network? – yzcyxisyxuyz Oct 12 '15 at 01:44
  • You could create a list of unique words and use the index in the list of each word. Another possibly simpler approach is to use theWord.GetHashCode(). If dealing with many words there is a chance of two different words having the same hash code. If the probable words is smallish, the chance of a hash collision is pretty small too. – Eric J. Oct 12 '15 at 01:47
  • Eric just so you know I did exactly what you suggested. Instead of just blindly putting in every character as a neuron, I came up with the a list of the most common words that appear in the various types of documents, and also words that would be unlikely to appear in the type of document, and each input was a count of each word. There were a total of 24 neurons, 24 hidden layers, and one output neuron. 97.5% accuracy!!! Thank you!! – yzcyxisyxuyz Oct 15 '15 at 22:09