What is the better way to tag entities for NER using CRF

Question

I am currently working on a custom named-entitie recognizer so as to recognize 4 types of entitiy: car, equipment, date, issue.

To do so, I use rasa_nlu with NER_crf from sklearn-crfsuite. However, before tagging hundreds of sentences, I asked myself two questions and I haven't found the answers:

If you have for example "On 31st Jan., the wheels of AA-075-ZP exhibited an increase in friction". Is it better to tag "On 31st Jan." or "31st Jan." as a date ? Same question for "the wheels" or "wheels" as an equipment.

I took a look at how does CRF work. From what I understood, the probability for a word w to be classified as an entity e1 depends on the fact that this word has already been tagged e1 in other documents but also on the fact that it follows a word w2 tagged e2 and that we often see words tagged e1 following words tagged e2.

Then, the question is: is it better to prefer entity tagging sequences or entity tagging content ? Is it more interesting to say that a date comes after "on" or that it is composed of "on" so as to detect this date ?

My samples are often a description of the issue such as: "On 31st Jan., the wheels of AA-075-ZP exhibited an increase in friction. This was caused by ... and .... on ... No more impact on the car, the four rubbers have been replaced" Is it interesting to tag "rubbers" as an equipment considering that it comes at the end of a long description and that I most of the time just want to get the first entities in the text ? Is it worth to increase the number of occurences for rubber (so that rubber has more chance to be tagged as an equipment) but to give at the same time importance to the pattern "an equipment coming after a lot of words" ?

Thank you in advance

score 1 · Answer 1 · answered Jul 24 '18 at 04:03

You seem to be confused about how NER works. You're trying to train a model so you can write functions that work like this:

sentence = "On Jan 31st. I went to Neptune, and then on Feb 3rd I went to Pluto."
get_dates(sentence) # => ['Jan 31st', 'Feb 3rd']
get_places(sentence) # => ['Neptune', 'Pluto']

In order to train the model, you tag the words you want you want in the function output. So don't tag context around a word. You can think of the tags as examples of the output from your function if it's working correctly.

Is it better to tag "On 31st Jan." or "31st Jan." as a date ?

You don't want "on" so don't tag it. "On" isn't part of a date.

is it better to prefer entity tagging sequences or entity tagging content ?

You tag the content so that the model can learn the sequences. Look at training data for generic NER models.

Is it interesting to tag "rubbers" as an equipment considering that it comes at the end of a long description and that I most of the time just want to get the first entities in the text ?

This depends on your application. If you gave your training sentence to your program and asked for a list of equipment, should "rubbers" be in that list? If it is, then you should tag it.

Thank you but you did not get what I wanted to say. I was wondering what was more likely to increase the probability to detect the date, its content or its context ? Obvisouly I want "31st Jan" as a date, but it is not difficult to format "on 31st Jan" to "31st Jan" once detected. I finally got an answer from https://www.sciencedirect.com/science/article/pii/S1532046413001196 "From the perspective of information extraction application, errors caused by determiners are insignificant"The second question was dumb, I admit — Antoine Deleuze, Jul 24 '18 at 07:19

What is the better way to tag entities for NER using CRF

1 Answers1