I am trying to create a training dataset for NER recognition. For that, I have huge amounts of data that need to be tagged and remove the unnecessary sentences. On removing the unnecessary sentence the index potion must be updated. Last day I saw some incredible code segments from some users about this which I cannot find now. Adapting their code segment I can brief my issue
Let's take a training sample data :
data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":68,"end":74,"tag":"fruit"},
{"id":4,"start":76,"end":82,"tag":"name"}]}]
This can be visualized using the following spacy display code
import json
import spacy
from spacy import displacy
data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":68,"end":74,"tag":"fruit"},
{"id":4,"start":76,"end":82,"tag":"name"}]}]
annot_tags = data[data_index]["annotations"]
entities = []
for j in annot_tags:
start = j["start"]
end = j["end"]
tag = j["tag"]
entitie = (start,end,tag)
entities.append(entitie)
data_gen = (data[data_index]["content"],{"entities":entities})
data_one = []
data_one.append(data_gen)
nlp = spacy.blank('en')
raw_text = data_one[0][0]
doc = nlp.make_doc(raw_text)
spans = data_one[0][1]["entities"]
ents = []
for span_start, span_end, label in spans:
ent = doc.char_span(span_start, span_end, label=label)
if ent is None:
continue
ents.append(ent)
doc.ents = ents
displacy.render(doc, style="ent", jupyter=True)
The output will be
Now I want to remove the sentence which is not tagged and update the index values. So the required output is like
Also data must be in the following format. Untagged sentence is removed and index values must be updated so that I can get the output like above.
Required output data
[{"content":'''Hello we are hans and john.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":42,"end":48,"tag":"fruit"},
{"id":4,"start":50,"end":56,"tag":"name"}]}]
I was following a post last day and got a near working code.
Code
import re
data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":68,"end":74,"tag":"fruit"},
{"id":4,"start":76,"end":82,"tag":"name"}]}]
for idx, each in enumerate(data[0]['annotations']):
start = each['start']
end = each['end']
word = data[0]['content'][start:end]
data[0]['annotations'][idx]['word'] = word
sentences = [ {'sentence':x.strip() + '.','checked':False} for x in data[0]['content'].split('.')]
new_data = [{'content':'', 'annotations':[]}]
for idx, each in enumerate(data[0]['annotations']):
for idx_alpha, sentence in enumerate(sentences):
if sentence['checked'] == True:
continue
temp = each.copy()
check_word = temp['word']
if check_word in sentence['sentence']:
start_idx = re.search(r'\b({})\b'.format(check_word), sentence['sentence']).start()
end_idx = start_idx + len(check_word)
current_len = len(new_data[0]['content'])
new_data[0]['content'] += sentence['sentence'] + ' '
temp.update({'start':start_idx + current_len, 'end':end_idx + current_len})
new_data[0]['annotations'].append(temp)
sentences[idx_alpha]['checked'] = True
break
print(new_data)
Output
[{'content': 'Hello we are hans and john. I love eating grapes. Hanaan is great. ',
'annotations': [{'id': 1,
'start': 13,
'end': 17,
'tag': 'name',
'word': 'hans'},
{'id': 3, 'start': 42, 'end': 48, 'tag': 'fruit', 'word': 'grapes'},
{'id': 4, 'start': 50, 'end': 56, 'tag': 'name', 'word': 'Hanaan'}]}]
Here the name john is lost. If more than one tag is present, I can't lose that too