3
def t_FUNC_(self, t):
        r'(?i)I|(?i)J|(?i)K|(?i)L|(?i)M|(?i)N|(?i)Y'
        return t

In above function I'm returning a regex which means FUNC can be I or J or K or L or M or N or Y.

Now, i have a dictionary like :

dic = { 'k1':'v1', 'k2':'v2' }

I have an access to this dictionary in the above function. How do i dynamically generate the regex from the keys of the dictionary. Size of the dictionary is also not fixed.

So, i want to replace r'(?i)I|(?i)J|(?i)K|(?i)L|(?i)M|(?i)N|(?i)Y' with something like r'(?i)k1|(?i)k2.

PS: Above pattern code is used to generate tokens when we write our lexer using ply library in python.

zubug55
  • 729
  • 7
  • 27
  • 2
    `t_FUNC_` returns its second parameter. It does not return a regex. – DYZ Jan 06 '19 at 02:54
  • @DYZ: That's the way PLY works. The regex is taken from the function's docstring and the function is called only after the regex is matched (and only if the regex is matched). The second parameter of the action function -- really the first argument since OP is using a lexer class -- is the token object which the scanner has already built; the idea is that the action function can modify the token object as it chooses before it is passed on to the parser. It can even manufacture an entirely new token object, by returning something other than the argument it is being given. – rici Jan 06 '19 at 07:38
  • @rici That's what thought. But _still_ the function does not return the regex, contrary to the OP's claim. – DYZ Jan 06 '19 at 07:40
  • 1
    @dyz: true, but I strongly suspect that was just a simple wording error. I doubt whether English is OP's first language. See their previous question https://stackoverflow.com/questions/54048095/getter-setter-as-function-in-python-class-giving-no-attribute-found-error for a slightly clearer description, which still needs to be read with a bit of generosity. – rici Jan 06 '19 at 07:44

3 Answers3

2

To put the keys of the dict into your regex is as simple as:

Code:

regex = '|'.join('(?i){}'.format(k) for k in data)

Test Code:

data = {'k1': 'v1', 'k2': 'v2'}
regex = '|'.join('(?i){}'.format(k) for k in data)
print(regex)

Results:

(?i)k1|(?i)k2
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
1
'(?i)'+'|'.join(re.escape(k) for k in dic)

You need the re.escape in case one of the dic keys happen to contain a control character in the regex language (like |). Also, the use of global inline flags like (?i) is deprecated anywhere in the pattern but the start. (If you only want it to apply to part of the expression, you can use the new local flag syntax, (?i:foo).)

gilch
  • 10,813
  • 1
  • 23
  • 28
  • The original question is about generating lexers with [ply]. Since Ply does not permit dynamic modification of the regular expressions after the lexical scanner is generated, this answer is not particularly relevant to the actual question. – rici Jan 06 '19 at 07:40
1

As @AustinHastings says in a comment, Ply builds the lexical scanner by combining the regular expressions supplied in the lexer class, either as the values of class members or as the docstrings of class member functions. Once the scanner is built, it will not be modified, so you really cannot dynamically adjust the regular expressions, at least not after the scanner has been generated.

For the particular application you have in mind, however, it is not necessary to create a custom regular expression. You can use the much simpler procedure illustrated in the Ply manual which shows how to recognise reserved words without a custom regular expression for each word.

The idea is really simple. The reserved words -- function names in your case -- are generally specific examples of some more general pattern already being used in the lexical scanner. That's almost certainly the case, because the lexical scanner must recognise every token in some way, so before a dynamically-generated word is added to the scanner, it must have been recognised as something else. Rather than trying to override that other pattern for the specific instance, we simply let the token be recognised and then correct its type (and possibly its value) before returning the token.

Here's a slightly modified version of the example from the Ply manual:

def t_ID(t):
     r'[a-zA-Z_][a-zA-Z_0-9]*'
     # Apparently case insensitive recognition is desired, so we use
     # the lower-case version of the token as a lookup key. This means
     # that all the keys in the dictionary must be in lower-case
     token = t.value.lower()
     if token in self.funcs:
         t.type = 'FUNC'
     return t

(You might want to adjust the above so that it does something with the value associated with the key in the funcs dictionary, although that could just as well be done later during semantic analysis.)

Since the funcs dictionary does not in any way participate in the generation of the lexer (or parser), no particular cleverness is needed in order to pass it into the Lexer object. Indeed, it does not even need to be in the lexer object; you could add the parser object to the lexer object when the lexer object is constructed, allowing you to put the dictionary into the parser object, where it is more accessible to parser actions.

One of the reasons that this is a much better solution than trying to build a customised regular expression is that it does not recognise reserved words which happen to be found as prefixes of non-reserved words. For example, if cos were one of the functions, and you had managed to produce the equivalent of

t_ID = r'[a-zA-Z_][a-zA-Z_0-9]*'
def t_FUNC(t):
    r'(?i)sin|cos|tan'
    # do something

then you would find that:

cost = 3

was scanned as FUNC(cos), ID(t), '=', NUMBER(3), which is almost certainly not what you want. Putting the logic inside the t_ID function completely avoids this problem, since only complete tokens will be considered.

rici
  • 234,347
  • 28
  • 237
  • 341