2

I and writing code to extract certain information from texts and I am using spaCy.

The goal is that IF a particular token of a text contains the string "refstart" then I want to get the noun chunk preceding that token. just for info: this token containing "refstart" and "refend" is generated using regex previously to creating the nlp object in spacy .

So far i using this code:

import spacy
nlp = spacy.load('en_core_web_sm')
raw_text='Figure 1 shows a cross-sectional view refstart10,20,30refend of a 
refrigerator refstart41,43refend that uses a new cooling technology  refstart10,23a,45refend including a retrofitting pump including high density fluid refstart10refend.'

doc3=nlp(raw_text)

list_of_references=[]
for token in doc3:
    # look if the token is a ref. sign
    # in order to see the functioning of the loops uncomment the prints
    # print('looking for:', token.text)
    if 'refstart' in token.text:
        #print('yes it is in')
        ref_token_text     = token.text
        ref_token_position = token.i
        # print('token text:',ref_token_text)
        for chunk in doc3.noun_chunks:
             if chunk.end == ref_token_position:
                 # we have a chunck and a ref. sign
                 list_of_references.append((chunk.text, chunk.start, chunk.end, ref_token_text))
                 break

This works, I get a list with tuples including the nounchuncks the start end and the token text following the noun chunck which includes the string refstart in it.

the result of this code should be:

  • a cross-sectional view, refstart10,20,30refend
  • a refrigerator, refstart41,43refend
  • a new cooling technology, refstart10,23a,45refend
  • high density fluid, refstart10refend

See how "retrofiting pump" is not part of the list because is not followed by a token including "refstart"

This is nevertheless very inefficient for loops over text that are very large can slow down the data pipeline a lot.

Solution 2: I thought about creating a list of tokens with their positions and a list of noun chunks

# built the list with all the noun chunks, start and end in the text
list_chunks=[]
print("chuncks")
for chunk in doc3.noun_chunks:
   list_chunks.append((chunk.text,chunk.start,chunk.end))
   try:
       print(f'start:{chunk.start},end:{chunk.end} \t \t {chunk.text} \t following text:{doc3[chunk.end+1]}')
   except:
       # this is done just to avoid error breaking in the last chunk
       print(f'start:{chunk.start},end:{chunk.end} \t \t {chunk.text} \t following text:last on')

 print("refs------------------")  
 # build the list with all the tokens and their position
 list_ref_tokens=[]
 for token in doc3:
     if 'refstart' in token.text:
          list_ref_tokens.append((token.text,token.i))
          print(token.text,token.i)

but now I would have to compare the Tupels inside of list_chunks and list_ref_tokens which is also tricky.

any other suggestion?

thanks.

dennlinger
  • 9,890
  • 1
  • 42
  • 63
JFerro
  • 3,203
  • 7
  • 35
  • 88

0 Answers0