-2

I am trying to classify a large number of words into 5 categories. Examples of classes and strings for each class include:

invoice-Number : "inv123","in12","123"
invoice-Date   : "22/09/1994","22-Mon-16"
vendor-Name    : "samplevendorname"
email          : "abc@gmail.com"
net-amount     : "1234.56"

Any pointers to achieve this in python is very much appreciated.

EDIT 1: I'm looking for a machine learning approach as the number of classes will be more and the data in each class will be different so regex is not feasible.

George Joseph
  • 39
  • 1
  • 8

3 Answers3

2

As you asked for pointers, read about regular expressions. They allow you to check if a string matches a certain pattern.

Python has built-in support of RegEx via the re module. See the re.match function.

Unfortunately, I am myself a beginner with RegEx, so I can't help you more. But I have provided you with the required links above. Hopefully, that will be enough to solve your problem.

Meanwhile, I will ask a friend to answer this question.

EDIT:

I dug into RegEx for a minute and this is what I came up with:

import re

s = <Whatever you are trying to match>

invoice_number = '(inv|in)\d+'
invoice_date = '((\d{2}/\d{2}/\d{4})|(\d{2}-[A-Z][a-z]{2}-\d{2}))'
vendor_name = '[a-z]+'
email = '\w+@\w(\.\w+)+'
net_amount = '\d+\.\d{2}'

if re.match(invoice_number, s):
    # classify as invoice-number
elif re.match(invoice_date, s):
    # classify as invoice-date
elif re.match(vendor_name, s):
    # classify as vendor-name
elif re.match(email, s):
    # classify as email
elif re.match(net_amount, s):
    # classify as net-amount
else:
    # OOPS!!!
Abhishek Kumar
  • 461
  • 2
  • 11
1

You can start with a based idea of BoW (Bag of Word) but modify to BoC (Bac of character) with a tokenizer that doesn't remove any character and build a dictionary of n-grams for 1 to 4 characters.

After that you can represent any word as a vector, that can be counter the number of presences, yes or not presence or tfidf.

Then build your model and pass the words-vector to it for learn. You can study the cross label of the n-grams to discard the ones that make noise in the dataset.

I hope this helps for a start point.

Tzomas
  • 704
  • 5
  • 17
0

Try to find difference amongst classes. Eg, I can see that the invoice number tends to have a mixture of letters and numbers, and the date is likely to include a / or -, and the email has to have an @ and finally the net amount will consist entirely of numbers. If you could use those properties I'm sure you could categorise them easily.

Other wise, if it is more difficult, you could try using NLTK, but I don't know how well that would work for this example.

Silver
  • 67
  • 9