I have gathered over 20,000 legal pleadings in PDF format. I am an attorney, but also I write computer programs to help with my practice in MFC/VC++. I'd like to learn to use neural networks (unfortunately my math skills are limited to college algebra) to classify documents filed in lawsuits.
My first goal is to train a three layer feed forward neural network to recognize whether a document is a small claims document (with the letters "SP" in the case number), or whether it is a regular document (with the letters "CC" in the case number). Every attorney puts some variant of the word "Case:" or "Case No" or "Case Number" or one of an infinite variations of that. So I've taken the first 600 characters (all attorneys will put the case number within the first 600 chars), and made a CSV database with each row being one document, with 600 columns containing the ASCII codes of the first 600 characters, and the 601st character is either a "1" for regular cases, or a "0" for small claims.
I then run it through the neural network program coded here: https://takinginitiative.wordpress.com/2008/04/23/basic-neural-network-tutorial-c-implementation-and-source-code/ (Naturally I update the program to handle 600 neurons, with one output), but when I run through the accuracy is horrible - something like 2% on the training data, and 0% on the general set. 1/8 of documents are for non-small claims cases.
Is this the sort of problem a Neural Net can handle? What am I doing wrong?