Extracting text from a passage using spacy or nltk

Question

Sorry if this is a repeat but I couldn't find an answer or at least would like to know if there is a clean way to do this. I have a passage from which I need to extract certain entities.

Any alphanumeric string like: PQ1234, Z123 etc Any alphanumeric string followed by another number after a space: PQ1234 01, Z123 08 Any alphanumeric string followed by another number after a space: PQ1234 01 02, Z123 07 08. As a concrete example below, the strings in bold should be extracted:

01: Once, there was a boy named AZ009 who became bored when he watched over the village PQ123 01 sheep grazing on the B0199. To entertain himself, he sang out, “R0199 01 09! R0199 01 09! R0199 01 09 is chasing the sheep!”

Rest all I want to ignore. I attempted this using spacy's NOUN, PROPN filter along with string functions like isalpha and isdigit to further filter it but it is becoming too rule based and not able to implement it too well.

I am a newbie to NLP and so wanted to know if there is a smarter way or if through some RegEx rule, I can get it done better.

Thanks

In spacy there is the `shape` option. E.g. AZ009 has shape XXddd where X is a capital letter and d is a digit. Maybe you could try this with Matcher? — krisograbek, Jun 20 '21 at 12:21

score 2 · Accepted Answer · answered Jun 20 '21 at 22:46

Assuming that the pattern:

starts with capital letters \b[A-Z]+
continues with some digits and spaces [\s\d]+
and always ends with a digit [\d]\b

You can try:

import re

text = """Once, there was a boy named AZ009 who became bored when he watched over the village PQ123 01 sheep grazing on the B0199. To entertain himself, he sang out, “R0199 01 09! R0199 01 09! R0199 01 09 is chasing the sheep!”"""

re.findall(r'\b[A-Z]+[\s\d]+[\d]\b', text)

[out]:

['AZ009', 'PQ123 01', 'B0199', 'R0199 01 09', 'R0199 01 09', 'R0199 01 09']

If you need the string offsets/positions of what you're trying to extract, try:

for match in re.finditer(r'\b[A-Z]+[\s\d]+[\d]\b', text):
    print(match.start(), match.start() + len(match.group()), match.group())

[out]:

28 33 AZ009
84 92 PQ123 01
114 119 B0199
157 168 R0199 01 09
170 181 R0199 01 09
183 194 R0199 01 09

Extracting text from a passage using spacy or nltk

1 Answers1