0

I am newbie at machine learning and data mining. Here's the problem: I have one input variable currently which is a small text comprises of non-standard nouns and want to classify in target category. I have about 40% of total training data from entire dataset. Rest 60% we would like to classify as accurately as possible. Followings are some input variables across multiple observations those are assigned 'LEAD_GENERATION_REPRESENTATIVE' title.

"Business Development Representative MFG"
"Business Development Director Retail-KK"
"Branch Staff"
"Account Development Rep"
"New Business Rep"
"Hong Kong Cloud"
"Lead Gen, New Business Development"
"Strategic Alliances EMEA"
"ENG-BDE"

I think above give idea what I mean by non-standard nouns. I can see here few tokens that are meaningful like 'development','lead','rep' Others seems random without any semantic but they may be appearing multiple times in data. Another thing is some tokens like 'rep','account' can appear for multiple category. I think that will make weighting/similarity a challenging task.

My first question is "is it worth automating this kind of classification?"

Second : "is it a good problem to learn machine learning classification?". There are only 30k such entries and handful of target categories. I can find someone to manually do that which will also be more accurate.

here's my take on this problem so far:

Full-text engine: like solr to build index and query rules that draws matches based on tokens - word, phrase, synonyms, acronyms, descriptions. I can get someone to define detail taxonomy for each category. Use boosting, use pluggable scoring lib

Machine learning: Naive Bayes classification Decision tree SVM

I have tried out Solr for this with revers lookup though since I don't have taxonomy available at moment. It seems like I can get about 80% true positives (I'll have to dig more into confusion matrix to reduce false positives). My query is bunch of booleans terms and phrases with proximity and boosts; negations to reduce errors. I'm afraid this approach may lead to overfit and wont scale.

I am aware that people usually tries multiple modeling techniques to achieve which one works best or derives combination of techniques. I want to understand this problem with feasibility and complexity point of view. If its too broad question please just comment on feasibility of solution.

nir
  • 3,743
  • 4
  • 39
  • 63
  • If your data is constant and never change, absolutely just have some people manually look at them (30K is not a lot). Otherwise, you might want consider having a classifier to help you on the tail queries. How many target categories do you have? – greeness Jul 16 '16 at 01:00
  • it's unstructured input variable and we do get new observations every month but at lower rates. we have about 15 target categories. what do you mean by using classifier for tail queries - do you mean use classifier that generates features based on phrases and n-grams ? – nir Jul 16 '16 at 20:30
  • These lower rates observations are considered the tail queries. You can build a classifier to handle the tail but use human labels as well at the head queries. The solr approach you mentioned above is still a classifier (nearest-neighbor classifier). I would expect a simple naive bayes would perform slightly better (accuracy and speed) than nearest neighbor for your case (since you have only 15 target categories and I assume you will get about 50K or more training examples). – greeness Jul 16 '16 at 21:27

0 Answers0