1

I am working on a data format like this.

data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.
I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                {"id":2,"start":59,"end":65,"tag":"fruit"}]}]

enter image description here

and i did want a data format like this. The sentences which do not have any entities has to be removed. And update the start and end of other entities according to the removed sentence.

result_data = data = [{"content":'''Hello I am Aniyya. I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                {"id":2,"start":33,"end":39,"tag":"fruit"}]}]

enter image description here

I am not getting any particular logic for this. I know this is like asking to code for me, but if any of have time to help me with this i appreciate a lot. i kind of stuck at this. There is a similar type question from me asked previously but it also didnt worked out at me. So thought of describe more details. Solution for this will be helpful for all those who are preparing the dataset related to NLP tasks. Thanks in advance.

Visualization is done with spacy displacy, Code is in visualizing NER training data and entity using displacy

2 Answers2

0
import re

data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.
I love eating grapes. Aniyya is great.''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                {"id":2,"start":59,"end":65,"tag":"fruit"},
                                {"id":3,"start":67,"end":73,"tag":"name"}]}]
         
         
         
for idx, each in enumerate(data[0]['annotations']):
    start = each['start']
    end = each['end']
    word = data[0]['content'][start:end]
    data[0]['annotations'][idx]['word'] = word
    
sentences = [ {'sentence':x.strip() + '.','checked':False} for x in data[0]['content'].split('.')]

new_data = [{'content':'', 'annotations':[]}]
for idx, each in enumerate(data[0]['annotations']):
    for idx_alpha, sentence in enumerate(sentences):
        if sentence['checked'] == True:
            continue
        temp = each.copy()
        check_word = temp['word']
        if check_word in sentence['sentence']:
            start_idx = re.search(r'\b({})\b'.format(check_word), sentence['sentence']).start()
            end_idx = start_idx + len(check_word)
            
            current_len = len(new_data[0]['content'])
            
            new_data[0]['content'] += sentence['sentence'] + ' '
            temp.update({'start':start_idx + current_len, 'end':end_idx + current_len})
            new_data[0]['annotations'].append(temp)
            
            sentences[idx_alpha]['checked'] = True
            break

Output:

print(new_data)
[{'content': 'Hello I am Aniyya. I love eating grapes. Aniyya is great. ', 'annotations': [{'id': 1, 'start': 11, 'end': 17, 'tag': 'name', 'word': 'Aniyya'}, {'id': 2, 'start': 33, 'end': 39, 'tag': 'fruit', 'word': 'grapes'}, {'id': 3, 'start': 41, 'end': 47, 'tag': 'name', 'word': 'Aniyya'}]}]
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • Good work. But a small problem with start and end position of second sentence. [{'content': 'Hello I am Aniyya. I love eating grapes. ', 'annotations': [{'id': 1, 'start': 11, 'end': 17, 'tag': 'name', 'word': 'Aniyya'}, {'id': 2, 'start': 14, 'end': 20, 'tag': 'fruit', 'word': 'grapes'}]}] –  Oct 22 '21 at 09:20
  • Near Perfect. As @Nebu-Lin suggested start and end keys of second sentence is not updated correctly. –  Oct 22 '21 at 09:29
  • ah. ya I see. I put the idx start of the individual sentence, while it needs to be within the full content. Give me a minute to fix – chitown88 Oct 22 '21 at 09:50
  • I provided another solution to this [here](https://stackoverflow.com/questions/69685506/deleting-and-updating-a-string-and-entity-index-in-a-text-document-for-ner-train/69705691#69705691) – chitown88 Oct 25 '21 at 12:15
0

From What I see in the Question is that there is a delimiter to Separate a Sentence which is '.' (DOT). In that way, u can separate the sentences into different Units, and then for each sentence, u can try checking if it's a valid sentence with annotation available or not, Else delete or splice that sentence from the string.

I've written a draft of a solution for the same, it's getting the job done. Feel free to suggest any change. Also u probably need to tune it to your exact requirement

data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},                {"id":2,"start":59,"end":65,"tag":"fruit"}]}]
identifier = '#'

def processRow(row):
    annotations = row["annotations"]
    temp = row["content"]
    startIndex = 0;
    endIndex = 0;
    annotationMap = dict()
    for annotation in annotations:
        start = annotation["start"]
        end = annotation["end"] - 1
        temp = temp[:end] + identifier + temp[end+1:]
        
    result = ""
    temp = temp.split(".")
    content = row["content"].split(".")
    
    for tempRow,row in zip(temp,content):
        if identifier in tempRow:
            result = result + row + "."
            
    return result

def processData(data):
    for row in data:
        temp = processRow(row)
        row["content"] = temp
    print(data)
    
    
processData(data)
  • Start and end tag of second sentence is not updated according to new sentence. The remaining is great –  Oct 22 '21 at 09:27