-1

We have large number of receipts (more than 20k) and want to categories these receipts. One receipt can belongs to one or more categories. And we have more than 500 categories currently.

i.e

  • If the receipt is about payment for internet. Then our category is "InternetService" and we have ISP information and payment information in the receipt.
  • If the receipt is about a lunch outing then we our category is "FoodAndBeverages" and we have restaurant name, food information and amounts.
  • If the receipt is about a payment for taxi then our category is "Transportation" and we have taxi company information, vehicle, driver, location information and amounts.

So other than the categories I mentioned in the above examples we have Tax category and most of the receipts are part of it. So each receipt can have one or more categories.

So to guess this category we went with multi-label classification solution. For the time being we will take the whole text of the receipt and train our model with the receipt text and categories we have.

Want to verify that we are following the correct approach to solve this issue. Looking forward to have the thoughts of experts here.

lsc
  • 681
  • 3
  • 8
  • 26
  • 1
    Why do you think this is a "multi-label classification" problem? do you have receipts with more than one class, like "InternetService" and "FoodAndBeverages" at the same time? I know you wrote "yes" in the initial description, but none of your examples is of that kind. Are the receipts in the form of images or text? You didn't specify too much about your approach – Mark.F Dec 22 '18 at 11:21
  • Seems I missed it when describing. I will update the question. Yea receipts are in the form of images and we were able to get text out of them using google vision api. – lsc Dec 22 '18 at 16:05

1 Answers1

1

According to your explanation, the problem which you are solving is a Multi-class classification and not multi-label classification based on your examples.

If each receipt is mapped to only one category out of many possible categories, then it is Multi-class classification.

If each receipt could be mapped to more than one category out of many possible categories, then it is Multi-label classification.

For more explanation and to know about available algorithms in sklearn to solve these problems look here.

For more basic steps to work with text data, read here

EDIT:

You could have a separate model to predict the tax category for each receipt. Since building multiple multi-class model is relatively easier than single multi-label model.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77