1

The regex [a-zA-Z-]+:: matches strings like one-name::.

I would like a regex that will do the opposite. Is there an "automated" way to build such a regex?

I can't just check the failure of the first regex, as I have to use a "negated" regex with lark.


Context of use

Indeed, my question is related to the code below that doesn't work because line must also reject blockname.

TEXT = """

// Comment #1

abrev::
    ok = ok

=========
One title
=========

Bla, bla
Bli, Bli

verbatim::
    OK, or not ok ?
    That is...

    ...the question.

Blu, Blu


==
10
==

// Comment #2
Blo, blo




==
05
==
    """

from lark import Lark

GRAMMAR = r"""
?start: _NL* (heading | comments | block)*

heading : ruler _NL title _NL ruler _NL+ (block | comments | paragraph)*
ruler   : /={2,}/
title   : /[^\n={2}\/{2}]+/

comments : "//" cline _NL*

paragraph : (line _NL)+

block : blockname _SINGLE_NL (tline _NL)+
blockname : /[a-zA-Z-]+::/

tline : "    " /[^\n]+/
cline : /[^\n]+/
line  : /[^\n={2}\/{2}]+/

_NL        : /(\r?\n[\t ]*)+/
_SINGLE_NL : /([\t ]*\r?\n)/
"""

parser = Lark(GRAMMAR)

tree = parser.parse(TEXT)

print(tree.pretty())

I have the following bad output.

start
  comments
    cline        Comment #1
  block
    blockname   abrev::
    tline       ok = ok
  heading
    ruler       =========
    title       One title
    ruler       =========
    paragraph
      line      Bla, bla
      line      Bli, Bli
      line      verbatim::             <<< BAD
      line      OK, or not ok ?        <<< BAD
      line      That is...             <<< BAD
      line      ...the question.       <<< BAD
      line      Blu, Blu               <<< BAD
  heading
    ruler       ==
    title       10
    ruler       ==
    comments
      cline      Comment #2
    paragraph
      line      Blo, blo
  heading
    ruler       ==
    title       05
    ruler       ==

I would like to obtain something like:

start
  comments
    cline        Comment #1
  block
    blockname   abrev::
    tline       ok = ok
  heading
    ruler       =========
    title       One title
    ruler       =========
    paragraph
      line      Bla, bla
      line      Bli, Bli
    block                              <<< GOOD 
      blockname verbatim::             <<< GOOD
      tline     OK, or not ok ?        <<< GOOD
      tline     That is...             <<< GOOD
      tline     ...the question.       <<< GOOD
    paragraph                          <<< GOOD
      line      Blu, Blu               <<< GOOD
  heading
    ruler       ==
    title       10
    ruler       ==
    comments
      cline      Comment #2
    paragraph
      line      Blo, blo
  heading
    ruler       ==
    title       05
    ruler       ==
projetmbc
  • 1,332
  • 11
  • 26
  • is it sufficient to check `re` didn't match? (`is None`) – ti7 Jan 19 '23 at 22:25
  • 1
    @ti7 The problem is that I need to use it with `lark`. I have updated my question. – projetmbc Jan 19 '23 at 22:39
  • What exactly is "the opposite"? Are you seeking to match any line whatsoever which doesn't match the blockname pattern? Or any line which contains a prefix which doesn't match the blockname pattern? Or any word which is not followed by `::`? – rici Jan 20 '23 at 16:42
  • 1
    Anyway, the usual solution to this sort of problem is to use explicit lexer priorities. Lark's priority rules often produce undesired results. – rici Jan 20 '23 at 16:48
  • @rici Here `lark` sees a `line` but not a `blockname` because of the fragility of my rule. If I need to work with another less fragile tool, I am ready to do that as far I can produce `Python` code to analyze my DSL. Any advice is welcome. – projetmbc Jan 20 '23 at 16:52

1 Answers1

1

This is probably not the best solution to this parsing problem, but it's relatively easy. Since the regexes here are Python regexes, you are free to use negative lookahead assertions, so you can put (?![a-zA-Z-]+::) at the beginning of your line pattern to avoid having that pattern match a blockname.

But the rest of that pattern is not doing what you think it is doing; a character class (inside [ and ]) is just a set of characters; it cannot contain subpatterns like ={2}. All that means inside a character class is "one of the characters =, {, 2 or }. Again, negative lookahead assertions are probably what you want. I changed line to this:

line  : /(?![a-zA-Z-]+::)((?!==|\/\/).)+/

and it more or less worked, except that your _NL pattern absorbs whitespace after the matched \n, which means that it will absorb the indent at the beginning of the second line in the block, so that won't be matched as a bline. You really need to rethink your whitespace matching strategy, but that's a different question.

rici
  • 234,347
  • 28
  • 237
  • 341
  • You are right for the several errors. I am just starting to play with grammars and `lark`. I will fix this now. – projetmbc Jan 20 '23 at 19:27
  • You say : "This is probably not the best solution to this parsing problem". Can you give me some advice in relation to this comment? – projetmbc Jan 20 '23 at 19:53
  • Are these grammars https://raw.githubusercontent.com/ligurio/lark-grammars/master/lark_grammars/grammars/mdoc.lark and https://github.com/ligurio/lark-grammars/blob/master/lark_grammars/grammars/yaml.lark good starting points? – projetmbc Jan 20 '23 at 22:47
  • 1
    @projetmbc: I think you could come up with a more precise set of lexical patterns which didn't require multiple scans of the input. But it would depend on the precise specification of the input language, which I could only guess at. I suspect that the `==` and `//` restrictions should only apply at the beginning of the line, for example, but I was going by what I guessed you meant by that regex. If they only apply at the beginning of the line, checking for them at each position is wrong. The negative lookahead assertion is just inefficient, but not horribly so. – rici Jan 21 '23 at 01:44
  • 1
    But what I had in mind is that the problem here is less about matching some formal language, and more about recognising lexical clues. If there is some point at which successive lines must have the same indent but it's not known what that indent is, then the language is not actually context-free and some other ad hoc parsing mechanism is required. Really, your first step is to try to document as accurately as possible the precise syntax you are trying to match. With that in hand, the parsing problem is, at least, clearer. – rici Jan 21 '23 at 01:48
  • 1
    WRT to the grammars you discovered on ligurio's github repository, I don't know. There is no quality control mechanism; they could be brilliant or they could be garbage, or anywhere in between. And I'm afraid I'm not interested enough to do any sort of testing. But testing is precisely what is necessary. The only tests I see are based on deducing valid inputs from the grammars themselves, which I don't find particularly convincing since it cannot validate that the grammar matches the intended language. But you'll have to do your own due diligence, I'm afraid. – rici Jan 21 '23 at 01:58
  • Thank you for your comments. I will work more seriously on my DSL. If necessary, I will come back here with specific questions. – projetmbc Jan 21 '23 at 09:34