0

I would like to get the index within the raw text of an entity found with the poplyglot library of python.

    # Polyglot example NER
    from polyglot.text import Text
    text1 = u'Ik wil Ben mijn zoontje met de naam Ben ziek melden.'
    print(text1)
    ptext1 = Text(text1)
    print(ptext1.entities)
    for sent in ptext1.sentences:
        for entity in sent.entities:
          print(entity.tag, entity, entity.start, entity.end)

result is: [I-PER(['Ben'])] I-PER ['Ben'] 8 9

So the question is how do i get the start and end index if these chunk indexes within the original sentence?

  • It looks like that library is giving you the start and end indices -- the token Ben (an entity) starts at token 8 and runs until the start of token 9... – duhaime Jul 04 '18 at 01:07
  • The question is how to translate the token index e.g. token 8 to the index of the start of the token within the raw string. – Ramon Ankersmit Jul 04 '18 at 18:14
  • You mean you want the index position of the first word in the given sentence? Do you want the index position of that first word among all words in the document? Your goal still isn't quite clear... – duhaime Jul 04 '18 at 20:56
  • 1
    Just found a solution for my problem: Think I have found an answer (maybe not the best one but it works now) ` ptext1 = Text(text1) prevIndex = 0 for sent in ptext1.sentences: for entity in sent.entities: print(entity.tag, entity, entity.start, entity.end) currentIndex = ptext1.index(entity[0], prevIndex) print('startindex={}, endindex={}'.format(currentIndex, currentIndex+len(entity[0]))) prevIndex = currentIndex+len(entity[0]) ` This will provide the start index and end index of an entity within the original string. – Ramon Ankersmit Jul 04 '18 at 22:37
  • Great! Why don't you post this as an answer below? It might help someone else! – duhaime Jul 05 '18 at 11:10

2 Answers2

0

Just found a solution for my problem (maybe not the best one but it works now):

ptext1 = Text(text1) 
prevIndex = 0 
for sent in ptext1.sentences: 
    for entity in sent.entities: 
        print(entity.tag, entity, entity.start, entity.end) 
        currentIndex = ptext1.index(entity[0], prevIndex) 
        print('startindex={}, endindex={}'.format(currentIndex, currentIndex+len(entity[0]))) 
        prevIndex = currentIndex+len(entity[0]) 

This will provide the start index and end index of an entity within the original string.

0

If someone one day need a better version:

from typing import Tuple
from polyglot.text import Text, Sentence, Chunk

doc = "         Apple is looking at buying Samsung for $1 billion and Donald Trump isnt happy.               Second sentence with this time Joe Biden."
text = Text(doc, hint_language_code="en")

def get_position_in_text(sentence: Text, entity: Chunk) -> Tuple[int, int]:
    """ Get the position in text (chars count) """
    sent = sentence.raw
    start_search = len("".join(sentence.words[0:entity.start]))
    try:
        start_pos = sent.index(entity[0], start_search)
        # Its a single world, that case is eaiser
        if len(entity) == 1:
            return start_pos, start_pos + len(entity[0])
        else:
            start_search = start_pos + len("".join(sentence.words[entity.start:entity.end - 1]))
            end_pos = sent.index(entity[-1], start_search)
            return start_pos, end_pos + len(entity[-1])
    except ValueError:
        return -1, -1

print(text.raw + "\n")
for entity in text.entities:
    # Polyglot do not gives you the position
    # but its possible with an algorithm to find
    # it...
    start_pos, end_pos = get_position_in_text(text, entity)
    print(entity.tag, entity, "start", start_pos, "end", end_pos)

This is a much better version because the version above trully gives per sentence, and the sentence is stripped for spaces before/after, causing the offset to be wrong very easily.

This one instead use text.raw which keep the text intact with spaces and so on.

Deisss
  • 63
  • 5