I need to tag parts of text in an HTML document. However, it mostly consists of text in form of dates, company names, Addresses, etc. I plan to use CRF (sklearn-crfsuite)
My problem is that it is difficult to divide the dataset into sentences. Can we train a CRF model without sentence boundaries treating everything as a single sequence? The tutorials in CRFSuite or sklearn-crfsuite do not talk about this.
If it cannot be done without sentence segmentation, any hints on how to divide such texts into sentences?
The data is something like this: (i cannot share the actual data)