-1

My goal is given a set of documents (mostly in financial domain), we need to identify specific parts of it like Company Name or Type of the document, etc.

The training is assumed to be done on acouple of 100's of documents. Obviously I would have a skewed class distribution (with None dominating around 99.9% of the examples). I plan to use CRF (CRFsuite on Sklearn) and have gone through the necessary literature . I needed some advice on the following fronts :

  • Will the dataset be sufficient to train the CRF? Considering each document can be split into around 100 tokens (each token being a training instance) , we would get 10000 instances in total.

    • Will the data set be too skewed for training a CRF? For ex: for 100 documents I would have around 400 instances of given class and around 8000 instances of None
sir_osthara
  • 154
  • 2
  • 9

1 Answers1

1
  1. Nobody knows that, you have to try it on your dataset, check resulting quality, maybe inspect the CRF model (e.g. https://github.com/TeamHG-Memex/eli5 has sklearn-crfsuite support - a shameless plug), try to come up with better features or decide to annotate more examples, etc. This is just a general data science work. Dataset size looks on a lower side, but depending on how structured is the data and how good are features a few hundred documents may be enough to get started. As the dataset is small, you may have to invest more time in feature engineering.
  2. I don't think class imbalance would be a problem, at least it is unlikely to be your main problem.
Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65