-1

I am wondering if anyone has any ideas to the correct approach and suitable algorithms for the below scenario:

There a thousands of distinct documents each with their own categorical encoding. These documents arrive into the system and need to be manually filed by the user into the correct folder. E.g.

Document Code Folder
ABC123 Folder 1
DEF456 Folder 2
GHI789 Folder 1

While we could create a mapping of document codes to the folder, this may be very cumbersome for so many codes that also may expand too. Furthermore, each customer may want to file the same type of document to different folder.


Is there a good approach to build a supervised model that would essentially learn which folder a specific document tends to get filed under using weighting from historical manual filing, then decide to file this automatically for the user in future?

I understand this weighting may difficult for a new document type that would need to be manually filed the first time and therefore be highly biased on the first occasion. But may be easier than building a classifier for the contents of the document that would ignore the code itself.

If anyone can point out some algorithms would be much appreciated!

bjg90
  • 51
  • 4

1 Answers1

1

I contributed to a model that has been used on over 1 million documents, using the document name. The short answer is yes BUT

  1. I know this boring, but: Don't use machine learning unless you really have to. Maintaining a production model ends up being a lot more work than you might expect if you have not had the pleasure. Furthermore, I would very be tempted to create the mapping as long as the # of codes is small, say less than 1000. Even if you want to create a model, in the long run, having a rules-based solution against which to benchmark it can be invaluable for getting the confidence of your stakeholders.

  2. If you do go the modeling approach learning this type of mapping should be in reach of some elementary algorithms, such as decision trees, or their more sophisticated cousins, random forest classifiers, and gradient boosting machines. With any algorithm, data science fundamentals, Understanding the customers' real needs, Thorough EDA, and sound experimental design will really be the key to whether what you build ends up helping anyone.

  3. No matter the approach you take, I'd advise keeping an iterative mindset start simple, evaluating, and add complexity (such as customizing the model to each user) bit by bit. Just like you would with a traditional software product/project.

Take a look at an XGBOOST classifier, as a fine place to start playing around. https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier

To learn more about designing products that rely on machine learning, I HIGHLY recommend "Building Machine Learning Powered Applications: Going from Idea to Product" by Emmanuel Ameisen.