I am using the following script so that Rasa framework will detect a Dutch postcode when it is passed by a user:
https://medium.com/@naoko.reeves/rasa-regex-entity-extraction-317f047b28b6
the format of a Dutch postcode is 1234 AB. This works great when using regex like:
[1-9][0-9]{3}[\s]?[a-z]{2}
However, I am now trying to implement a Speech-To-Text functionality (Azure Cognitive Services) that does not pick up the alphabet very easily. e.g 'B' is picked up as 'Bee'.
I am now trying to alter the regex so that the user can say '1 2 3 4 Alpha Bravo' and the regex extractor will pick out '1 2 3 4 A B'.
I have tried using word boundary like the following:
[1-9]*[\s]?[0-9]*[\s]?[0-9]*[\s]?[0-9]*[\s]?\b[a-zA-Z]
and
[1-9]\s[0-9\s]{5}\s?\b[a-zA-Z]
The former is far too lenient and if the user says 'Hello There', it will trigger the regex extractor and pass 'HT' to the postcode behaviour.
The latter is more strict but I can only get '1 2 3 4 Alpha Bravo' to match as '1 2 3 4 A'.
I'd really appreciate any solutions as to how I can solve this problem. If this is not easily achievable in Regex, I believe that altering the following function in the medium article linked would get the results I'm after. Unfortunately, I'm no Python/Regex expert :).
def match_regex(self, message):
extracted = []
for d in self.regex_feature:
match = re.search(pattern=d['pattern'], string=message)
if match:
entity = {
"start": match.pos,
"end": match.endpos,
"value": match.group(),
"confidence": 1.0,
"entity": d['name'],
}
extracted.append(entity)
extracted = self.add_extractor_name(extracted)
return extracted
I hope this is clear enough.
Thanks!
Jake