I am currently working on a custom named-entitie recognizer so as to recognize 4 types of entitiy: car, equipment, date, issue.
To do so, I use rasa_nlu with NER_crf from sklearn-crfsuite. However, before tagging hundreds of sentences, I asked myself two questions and I haven't found the answers:
- If you have for example "On 31st Jan., the wheels of AA-075-ZP exhibited an increase in friction". Is it better to tag "On 31st Jan." or "31st Jan." as a date ? Same question for "the wheels" or "wheels" as an equipment.
I took a look at how does CRF work. From what I understood, the probability for a word w to be classified as an entity e1 depends on the fact that this word has already been tagged e1 in other documents but also on the fact that it follows a word w2 tagged e2 and that we often see words tagged e1 following words tagged e2.
Then, the question is: is it better to prefer entity tagging sequences or entity tagging content ? Is it more interesting to say that a date comes after "on" or that it is composed of "on" so as to detect this date ?
- My samples are often a description of the issue such as: "On 31st Jan., the wheels of AA-075-ZP exhibited an increase in friction. This was caused by ... and .... on ... No more impact on the car, the four rubbers have been replaced" Is it interesting to tag "rubbers" as an equipment considering that it comes at the end of a long description and that I most of the time just want to get the first entities in the text ? Is it worth to increase the number of occurences for rubber (so that rubber has more chance to be tagged as an equipment) but to give at the same time importance to the pattern "an equipment coming after a lot of words" ?
Thank you in advance