3

I am using the following script so that Rasa framework will detect a Dutch postcode when it is passed by a user:

https://medium.com/@naoko.reeves/rasa-regex-entity-extraction-317f047b28b6

the format of a Dutch postcode is 1234 AB. This works great when using regex like:

 [1-9][0-9]{3}[\s]?[a-z]{2}

However, I am now trying to implement a Speech-To-Text functionality (Azure Cognitive Services) that does not pick up the alphabet very easily. e.g 'B' is picked up as 'Bee'.

I am now trying to alter the regex so that the user can say '1 2 3 4 Alpha Bravo' and the regex extractor will pick out '1 2 3 4 A B'.

I have tried using word boundary like the following:

[1-9]*[\s]?[0-9]*[\s]?[0-9]*[\s]?[0-9]*[\s]?\b[a-zA-Z]

and

[1-9]\s[0-9\s]{5}\s?\b[a-zA-Z]

The former is far too lenient and if the user says 'Hello There', it will trigger the regex extractor and pass 'HT' to the postcode behaviour.

The latter is more strict but I can only get '1 2 3 4 Alpha Bravo' to match as '1 2 3 4 A'.

I'd really appreciate any solutions as to how I can solve this problem. If this is not easily achievable in Regex, I believe that altering the following function in the medium article linked would get the results I'm after. Unfortunately, I'm no Python/Regex expert :).

 def match_regex(self, message):
    extracted = []
    for d in self.regex_feature:
        match = re.search(pattern=d['pattern'], string=message)
        if match:
            entity = {
                "start": match.pos,
                "end": match.endpos,
                "value": match.group(),
                "confidence": 1.0,
                "entity": d['name'],
            }
            extracted.append(entity)
    extracted = self.add_extractor_name(extracted)
    return extracted

I hope this is clear enough.

Thanks!

Jake

2 Answers2

1

May be you can try something like this regex:

(?i)\b([1-9][0-9]{3} ?[a-z])[a-z]* +([a-z])[a-z]*

Whatever is matched by this regex, just substitute it with \1\2 i.e., the contents of Group1 followed by contents of Group 2.

Click for Demo

Click for Code

Explanation:

  • (?i) - switch to make the match case-insensitive
  • \b - a word boundary
  • ([1-9][0-9]{3} ?[a-z]) - contents of group 1 described below
    • [1-9] - matches any digit from 1 to 9
    • [0-9]{3} - matches 3 occurrences of any digit from 0 to 9
    • ? - matches 0 or 1 occurrence of a space
    • [a-z] - matches a single occurrence of a letter. This will be the 1st letter of the first word after the digits
  • [a-z]* - matches 0+ occurrences of a letter
  • + - matches 1+ occurrences of a space
  • ([a-z]) - matches a letter and stores it in Group 2. This will be the 1st letter of the second word
  • [a-z]* - matches 0+ occurrences of a letter
Gurmanjot Singh
  • 10,224
  • 2
  • 19
  • 43
1

You can use 3 groups matching optional spaces between the digits and between the uppercase chars A-Z.

([1-9](?:\s*[0-9]){3})\s?([A-Z])[a-z]*\s*([A-Z])[a-z]*

The pattern matches

  • ([1-9](?:\s*[0-9]){3}) Match 4 digits with optional whitspace chars
  • \s? Match an optional whitespace
  • ([A-Z])[a-z]*\s* Match an uppercase char A-Z followed by optional lowercase chars and optional whitespac
  • ([A-Z])[a-z]* Match an uppercase char A-Z followed by optional lowercase chars

regex demo

A bit more strict option could be matching the uppercase char A-Z followed by only upper or lowercase variations of the same char using an optionally repeated backreference

\b([1-9](?:\s*[0-9]){3})\s?([A-Z])(?i:\2*)\s*([A-Z])(?i:\3*)\b

Regex demo | Python demo

import re

pattern = r"\b([1-9](?:\s*[0-9]){3})\s?([A-Z])(?i:\2*)\s*([A-Z])(?i:\3*)\b"
strings = [
    "1 2 3 4 Alpha Bravo",
    "1234 Alpha Bravo",
    "1234A Bbbbbbbc",
    "1234Aaa Bbb",
    "1234Aa Bbb",
    "1234A BbbbbBbb"
]

for s in strings:
    print(re.findall(pattern, s))

Output

[]
[]
[]
[('1234', 'A', 'B')]
[('1234', 'A', 'B')]
[('1234', 'A', 'B')]
The fourth bird
  • 154,723
  • 16
  • 55
  • 70