Deleting and updating a string and entity index in a text document for NER training data

Question

I am trying to create a training dataset for NER recognition. For that, I have huge amounts of data that need to be tagged and remove the unnecessary sentences. On removing the unnecessary sentence the index potion must be updated. Last day I saw some incredible code segments from some users about this which I cannot find now. Adapting their code segment I can brief my issue

Let's take a training sample data :

data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
                                {"id":2,"start":22,"end":26,"tag":"name"},
                                {"id":3,"start":68,"end":74,"tag":"fruit"},
                                {"id":4,"start":76,"end":82,"tag":"name"}]}]

This can be visualized using the following spacy display code

import json
import spacy
from spacy import displacy

data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
                                {"id":2,"start":22,"end":26,"tag":"name"},
                                {"id":3,"start":68,"end":74,"tag":"fruit"},
                                {"id":4,"start":76,"end":82,"tag":"name"}]}]

annot_tags = data[data_index]["annotations"]
entities = []
for j in annot_tags:
    start = j["start"]
    end = j["end"]
    tag = j["tag"]
    entitie = (start,end,tag)
    entities.append(entitie)
data_gen = (data[data_index]["content"],{"entities":entities})
data_one = []
data_one.append(data_gen)

nlp = spacy.blank('en')
raw_text = data_one[0][0]
doc = nlp.make_doc(raw_text)
spans = data_one[0][1]["entities"]
ents = []
for span_start, span_end, label in spans:
    ent = doc.char_span(span_start, span_end, label=label)
    if ent is None:
        continue

    ents.append(ent)

doc.ents = ents
displacy.render(doc, style="ent", jupyter=True)

The output will be

Output 1

Now I want to remove the sentence which is not tagged and update the index values. So the required output is like

Required Output

Also data must be in the following format. Untagged sentence is removed and index values must be updated so that I can get the output like above.

Required output data

[{"content":'''Hello we are hans and john.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
                                {"id":2,"start":22,"end":26,"tag":"name"},
                                {"id":3,"start":42,"end":48,"tag":"fruit"},
                                {"id":4,"start":50,"end":56,"tag":"name"}]}]

I was following a post last day and got a near working code.

Code

import re

data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
                                {"id":2,"start":22,"end":26,"tag":"name"},
                                {"id":3,"start":68,"end":74,"tag":"fruit"},
                                {"id":4,"start":76,"end":82,"tag":"name"}]}]
         
         
         
for idx, each in enumerate(data[0]['annotations']):
    start = each['start']
    end = each['end']
    word = data[0]['content'][start:end]
    data[0]['annotations'][idx]['word'] = word
    
sentences = [ {'sentence':x.strip() + '.','checked':False} for x in data[0]['content'].split('.')]

new_data = [{'content':'', 'annotations':[]}]
for idx, each in enumerate(data[0]['annotations']):
    for idx_alpha, sentence in enumerate(sentences):
        if sentence['checked'] == True:
            continue
        temp = each.copy()
        check_word = temp['word']
        if check_word in sentence['sentence']:
            start_idx = re.search(r'\b({})\b'.format(check_word), sentence['sentence']).start()
            end_idx = start_idx + len(check_word)
            
            current_len = len(new_data[0]['content'])
            
            new_data[0]['content'] += sentence['sentence'] + ' '
            temp.update({'start':start_idx + current_len, 'end':end_idx + current_len})
            new_data[0]['annotations'].append(temp)
            
            sentences[idx_alpha]['checked'] = True
            break
print(new_data)

Output

[{'content': 'Hello we are hans and john. I love eating grapes. Hanaan is great. ',
  'annotations': [{'id': 1,
    'start': 13,
    'end': 17,
    'tag': 'name',
    'word': 'hans'},
   {'id': 3, 'start': 42, 'end': 48, 'tag': 'fruit', 'word': 'grapes'},
   {'id': 4, 'start': 50, 'end': 56, 'tag': 'name', 'word': 'Hanaan'}]}]

Here the name john is lost. If more than one tag is present, I can't lose that too

Note that your training data should include sentences without any entities if your real data will include sentences like that. — polm23, Oct 24 '21 at 03:44
[("I am hans. I am happy",{"entities":[(5,9,"name"),(16,21,"mood")]})] ----> I guess this is the format for training NER data. Correct me if I am wrong. — imhans33, Oct 25 '21 at 07:35
I am not sure what context you are asking about. That is a common format for NER data used in spaCy and elsewhere. spaCy has no required format in v3, and you just need to save Docs that have the annotations you want. — polm23, Oct 25 '21 at 09:41

score 1 · Accepted Answer · answered Oct 25 '21 at 09:32

It's a pretty complicated task, in that, you need to identify sentences, as doing a simple split on the '.' may not work as it'll split on things like 'Mr.', etc.

Since you are using spacy, why not let that identify sentences, then iterate through those and calculate out those start end indexes, and not include any sentence that doesn't have an entity. Then reconstruct the content.

import json
import spacy
from spacy import displacy
import re

data = [{"content":'''Hello we are hans and john. I enjoy playing Football. \
I love eating grapes. Hanaan is great. Mr. Jones is nice.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
                                {"id":2,"start":22,"end":26,"tag":"name"},
                                {"id":3,"start":68,"end":74,"tag":"fruit"},
                                {"id":4,"start":76,"end":82,"tag":"name"},
                                {"id":5,"start":93,"end":102,"tag":"name"}]}]

for idx, each in enumerate(data[0]['annotations']):
    start = each['start']
    end = each['end']
    word = data[0]['content'][start:end]
    data[0]['annotations'][idx]['word'] = word
    
         
text = data[0]['content']

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('sentencizer')

doc = nlp(text)
sentences = [i for i in doc.sents]
annotations = data[0]['annotations']

new_data = [{"content":'',
            'annotations':[]}]
for sentence in sentences:
    idx_to_remove = []
    for idx, annotation in enumerate(annotations):
        if annotation['word'] in sentence.text:
            temp = annotation.copy()
            
            start_idx = re.search(r'\b({})\b'.format(annotation['word']), sentence.text).start()
            end_idx = start_idx + len(annotation['word'])
            
            current_len = len(new_data[0]['content'])
            
            
            temp.update({'start':start_idx + current_len, 'end':end_idx + current_len})
            new_data[0]['annotations'].append(temp)
            
            idx_to_remove.append(idx)
            
    if len(idx_to_remove) > 0:
        new_data[0]['content'] += sentence.text + ' '
    for x in range(0,len(idx_to_remove)):
        del annotations[0]

Output:

print(new_data)
[{'content': 'Hello we are hans and john. I love eating grapes. Hanaan is great. Mr. Jones is nice. ', 
'annotations': [
{'id': 1, 'start': 13, 'end': 17, 'tag': 'name', 'word': 'hans'}, 
{'id': 2, 'start': 22, 'end': 26, 'tag': 'name', 'word': 'john'}, 
{'id': 3, 'start': 42, 'end': 48, 'tag': 'fruit', 'word': 'grapes'}, 
{'id': 4, 'start': 50, 'end': 56, 'tag': 'name', 'word': 'Hanaan'}, 
{'id': 5, 'start': 67, 'end': 76, 'tag': 'name', 'word': 'Mr. Jones'}]}]

score 0 · Answer 2 · answered Oct 23 '21 at 07:37

0

Just delete

#sentences[idx_alpha]['checked'] = True
#break

Output

    [{'content': 'Hello we are hans and john. Hello we are hans and john. I love eating grapes. Hanaan is great. ', 
'annotations': 
[{'id': 1, 'start': 13, 'end': 17, 'tag': 'name', 'word': 'hans'}, 
{'id': 2, 'start': 50, 'end': 54, 'tag': 'name', 'word': 'john'}, 
{'id': 3, 'start': 70, 'end': 76, 'tag': 'fruit', 'word': 'grapes'}, 
{'id': 4, 'start': 78, 'end': 84, 'tag': 'name', 'word': 'Hanaan'}]}]

answered Oct 23 '21 at 07:37

cnp

339
2
11

Thanks for the reply. Now that sentence will be repeating again. If a sentnce has more than one tag, then repetition of sentence has to be avoided.. – imhans33 Oct 23 '21 at 07:54
another option, you can add sentenceID for each word. Then you can check the next word is the same sentenceID continue to do it. If not, run two deleted lines as I mentioned. – cnp Oct 23 '21 at 07:58
The entity start and end of second name "john" is on second line (50,57). Not on first line. So deleting repeated sentence wont work here :( – imhans33 Oct 23 '21 at 08:16
is this possible. tried but for me its not working – imhans33 Oct 23 '21 at 16:35

Deleting and updating a string and entity index in a text document for NER training data

2 Answers2

Linked