I'm trying to build a rule based text classification system to speed up the manual categorization of documents.
The long term aim is to use these manually classified documents as training data for an AI.
The classification system has over 100 categories.
The idea is to manually build a list of 'words' associated for each category in the classification system.
The list of words will be built by manually classifying a small no. of documents and manually identify the common words I find in each document.
The job of the rule engine is to attempt to identify other documents that belong in the same category based on the assigned words.
I'm looking to put a weight on each word associated with a category. The intent is to initially weight the words based on their occurrence in the manually labelled documents.
So if the word 'select' appeared in an a 1000 word document 50 times then it would get a weighting of 5% (50/1000).
The rule engines job is then to score other documents based on the occurrences of words and their relative weighting.
What I'm not certain is how this scoring process should work or how to normalize the data given the variance in document size. (From 100 words to 10,000 words would be typical).
The intent is to have an iterative process (manually validate classification/add-remove words/adjust weights/classify documents via rule-engine).
With each iteration the rule-engine will hopefully get better at correctly classifying the documents reducing the label process to a Good/Bad classification action. Providing a significant percentage (even 50% should probably be fine) are correctly label the process should proceed rapidly.
I've heard that concepts such as linear regression might apply to this type of problem but don't know enough to google effectively.
Edit: I've had some thoughts on how to go about the normalisation process.
- normalise all documents to an 'average' size of 1000 words.
- count the words in a document - total word count e.g. 250 words
- count each word of interest - e.g. CheckBox occurred 25 times
- calculate the occurrence of each word as a percentage of the actual documents word count e.g. checkbox = 10%
If we have three words of interest: checkbox, select, multi
We end up with a set of ratios:
checkbox: select : multi 0.05 : 0.01 : 0.02
When scoring we are now looking for documents that have the closest matching ratio.
If a document presents with the following ratio:
0.04 : 0.02 : 0.01
Then we can define the distance between the two documents as:
0.05 - 0.04 + 0.01 - 0.02 + 0.02 - 0.01 = 0.01
The problem with this approach is that we care about the over all distance so the second word is problematic as it reduces the distance as the ratio is in the opposite direction as the other words.
To counter this we need to flip the calculation on the second word so that it moves the distance in the same direction
0.05 - 0.04 + 0.02 - 0.01 + 0.02 - 0.01 = 0.03
The second equations would appear to more accurately reflect the distance between the two documents.
Given we are talking about distance rather than a vector we would always take the absolute value of the result.
A distance of zero is considered an exact match.
I'm not entirely happy with this approach as some words are 'good' words and any number of them should be considered a positive.
e.g. if the classification is checkbox then the word checkbox should always be seen to reduce the distance.
May be we deal with this by nominating one or more words as 'key words'.
When a keyword appears, if its word ratio is > than the expected ratio, then the distance for that word is considered 0.