ANTLR4 How can I create a regular expression that allows all except this two statements // How to delete a part of a input

Question

I saw this question: ANTLR4 How can I create a regular expression that allows all characters except two selected ones?

because of this I thought a little bit about antlr4 (I used antlr years ago).

Now I have a other question for example we would have:

A: [a-z]+'ug';
B: [A-Z][a-z]+;
C:

Now I want that C recognize all Characters which not belong to A or B.

How we could make this? What would be here the correct regEx?

   C: ~[a-zugA-za-z]

That would be false or?

I thougth a lot, but without sucess.

And a other question is, just for interest.

Now, for example, if I wanted antlr to recognize this for me:

I have e.g. as input:

thisisonlyatest/*oidjqiodjqw*/test

Now i want to delete all between /* */, so that the result is only:

thisisonlyatesttest

How we could make that?

Or for example the input would be:

thisisonlyatest/*oidjqiodjqw*/test
another line /*kjdqio*/ another text

the result:

thisisonlyatest test
another line another text

I tought that we could make:

A: ('/*'(.)*'*/')
B: ~A

but it didnot work.

the two questions cannot be answered with one answer ... please post two separate questions — jsotola, May 03 '22 at 01:40

Bart Kiers · Accepted Answer · 2022-05-03T12:18:53.647

0

A character set [a-za-z] is exactly the same as [a-z]. And including ug in a set that already contains a-z: [a-zug] is not necessary because the range a-z already includes both u and g. So C would just be:

C: ~[a-zA-Z]; // note that `A-z` must be `A-Z`

W.r.t.:

A: ('/*'(.)*'*/');
B: ~A; // No, this is incorrect!

As I mentioned in the other question: you cannot negate a rule (A in this case) that matches multiple characters. You can only negate rules that match a single character.

If you want to skip comments, just do:

COMMENT
 : '/*' .*? '*/' -> skip
 ;

EDIT

For example you have the input: AA this should also be at C but it would not work.

That is correct, you said you wanted C to recognize "all Characters which not belong to A or B". What you're looking for is probably just this:

A: [a-z]+ 'ug';
B: [A-Z][a-z]+;
C: [a-zA-Z]+;

That way, lowercase letters ending with "ug" would become an A token, letters starting with an uppercase will become a B token, and all the others will become C tokens.

edited May 03 '22 at 12:18

answered May 03 '22 at 06:16

Bart Kiers

166,582
36
299
288

Thank you! But I think C is false. Because: For example you have the input: AA this should also be at C but it would not work. Because AA would not be correct for A, but also not for B. – kaden42 May 03 '22 at 08:52
No, in my example C will always match a single char, not AA. You dint quite understand how ANTLRs lexer rules work. I recommend doing some tutorials before going forward. – Bart Kiers May 03 '22 at 11:50
Thank you, but [a-zA-z]+ will not recognize a input like &/)() or? Symbols will not be recognize? – kaden42 May 03 '22 at 19:33
No, `[a-zA-z]` only matches the ASCII range `A` to `z` (hex `0x41` to `0x7A`). – Bart Kiers May 03 '22 at 19:40

ANTLR4 How can I create a regular expression that allows all except this two statements // How to delete a part of a input

1 Answers1

EDIT