We have large number of receipts (more than 20k) and want to categories these receipts. One receipt can belongs to one or more categories. And we have more than 500 categories currently.
i.e
- If the receipt is about payment for internet. Then our category is "InternetService" and we have ISP information and payment information in the receipt.
- If the receipt is about a lunch outing then we our category is "FoodAndBeverages" and we have restaurant name, food information and amounts.
- If the receipt is about a payment for taxi then our category is "Transportation" and we have taxi company information, vehicle, driver, location information and amounts.
So other than the categories I mentioned in the above examples we have Tax category and most of the receipts are part of it. So each receipt can have one or more categories.
So to guess this category we went with multi-label classification solution. For the time being we will take the whole text of the receipt and train our model with the receipt text and categories we have.
Want to verify that we are following the correct approach to solve this issue. Looking forward to have the thoughts of experts here.