My goal is given a set of documents (mostly in financial domain), we need to identify specific parts of it like Company Name or Type of the document, etc.
The training is assumed to be done on acouple of 100's of documents. Obviously I would have a skewed class distribution (with None dominating around 99.9% of the examples). I plan to use CRF (CRFsuite on Sklearn) and have gone through the necessary literature . I needed some advice on the following fronts :
Will the dataset be sufficient to train the CRF? Considering each document can be split into around 100 tokens (each token being a training instance) , we would get 10000 instances in total.
- Will the data set be too skewed for training a CRF? For ex: for 100 documents I would have around 400 instances of given class and around 8000 instances of None