Highlight a bunch of words?

Question

I'm trying to highlight a bunch of words - so I've written a pygments extension. Basically it works, but still not to my satisfaction.

Here's a simple idea which should work: highlight words appropriately, and all other text which doesn't match these words - in text. But this hungs up:

from pygments.lexer import RegexLexer
from pygments.token import *

class HotKeyPoetry(RegexLexer):
    name = 'HotKeyPoetry'
    aliases = ['HotKeyPoetry']
    filenames = ['*.hkp']

    tokens = {
        'root': [

            (r'\bAlt\b', Generic.Traceback),
            (r'\bShft\b', Name.Variable),
            (r'\bSpc\b', Operator),
            (r'\bCtrl\b', Keyword.Type),
            (r'\bRet\b', Name.Label),
            (r'\bBkSpc\b', Generic.Inserted),
            (r'\bTab\b', Keyword.Type),
            (r'\bCpsLk\b', String.Char),
            (r'\bNmLk\b', Generic.Output),
            (r'\bScrlLk\b', String.Double),
            (r'\bPgUp\b', Name.Attribute),
            (r'\bPgDwn\b', Name.Builtin),
            (r'\bHome\b', Number.Oct),
            (r'\bEnd\b', Name.Constant),
            (r'\bDel\b', Name.Decorator),
            (r'\bIns\b', Number.Integer.Long),
            (r'\bWin\b', Name.Builtin.Pseudo),
            (r'\bF1?[1-9]\b', Name.Function),

            (r'(?!\b(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|PgDwn|Home|End|Del|Ins|Win|F5)\b)', Text),

        ]
    }

May be I should better use another lexer for the job?

Edit 1

So

r"(.+?)(?:$|\b(?=(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|P‌gDwn|‌Home|End|Del|Ins|Win|F[12]?[1-9])\b))"

is an exlusing regexp I've been looking for.

Now I'm trying to make # a comment char -- so that everything after it (within a line) -- is a comment: I've tried:

r"(.+?)(?:$|#.*$|\b(?=(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|P‌gDwn|‌Home|End|Del|Ins|Win|F[12]?[1-9])\b))"

and

r"([^#]+?)(?:$|\b(?=(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|PgD‌wn|‌Home|End|Del|Ins|Win|F[12]?[1-9])\b))"

followed by

 (r'#.*$', Comment),

I've also tried adding a second state:

'comment': [ 
      (r'#.*$', Comment),
],

-- but nothing works.

Edit 2

The complite working pygments extension python package is here. You can get and

python setup.py build
python setup.py install --user

it to register it in pygments. You can then test it with:

pygmentize -f html -O full -o test.html test.hkp

or specify a language:

pygmentize -f html -O full -l HotKeyPoetry -o test.html test.hkp

Here's a sample test.hkp:

Ctrl-Alt-{Home/End} ⇒ {beginning/end}-of-visual-line
Ctrl-Alt-{b/↓/↑} ⇒ {set/goto next/goto previous} bookmark # I have it in okular and emacs
Alt-{o/O} ⇒ switch-to-buffer{/-other-window}
Ctrl-{o/O} ⇒ find-file{/-other-window}
Ctrl-x o ⇒ ergo-undo-close-buffer # it uses ergoemacs' recently-closed-buffers
Ctrl-Alt-O ⇒ find-alternate-file

(comments are not really useful for Hot Keys -- but I need them for PyMOL).

Does the regex have to match at least one character? Perhaps the problem is that the last regex matches an empty string, so no characters are 'consumed', and it never advances. — MRAB, Aug 16 '12 at 15:24
Maybe You rihgt. Actually I thought - that the last regex has to match on of the ( | ) specified words. I'll check it it matches empty string. — Adobe, Aug 16 '12 at 16:19
Really what I meant was that as the `(?!...)` is a negative lookahead, it'll never consume any characters. — MRAB, Aug 16 '12 at 16:23
Re the edit: Your wildcard rule is probably getting run before the comment rule. See my updated answer. — alexis, Oct 22 '12 at 13:29

alexis · Answer 1 · 2012-10-25T14:17:19.683

4

1) You misunderstand how the (?! works: It doesn't match text. Your last RE (in the original code block) matches at a position that is not followed by any of the words you list. But it matches zero characters of text, so there's nothing to color and you don't move forward.

What you really meant is this: \b(?!(?:Alt|Shft|etc)\b)\w+\b. (Match any word \w+ between \bs, but not if the first \b is followed by any of the keywords)

2) About matching comments: Based on the pygments documentation, your expression (r'#.*$', Comment) ought to work. Or, in the style used in the examples:

(r'#.*\n', Comment),

3) You only need one state, so add the comment rule to the root state. Multiple states are for when you have different syntax in different places, e.g. if you have mixed html and PHP, or if you want to highlight the SQL inside a python string.

4) Your rules need to match everything in your input. Rules are tried in order until one works, so instead of trying to write a rule that does not match keywords, you can put this wildcard as your last rule:

(r'(?s).', Text),

It will advance one character at a time until you get to something your other rules can match. To repeat: Remove your long rule that matches non-keywords, and use the above instead.

edited Oct 25 '12 at 14:17

answered Aug 16 '12 at 21:05

alexis

48,685
16
101
161

You right `(?!` doesn't match text. `\b(?!(?:Alt|Shft|etc)\b)\w+\b` is a very good craft - but it leaves whitespace and punctuation unmatched. Changing `\w` to `.` breaks the thing. Any way I've read You answer several times, have read about non-capturing `(:?`, and I thank You for the answer. – Adobe Aug 17 '12 at 10:29
Yes, of course it leaves out whitespace and punctuations since all your code is token based. Since you've got a program that you like in the other answer, I won't ask what exactly you wanted to match. – alexis Aug 19 '12 at 19:26
But what does `r'(?s).'` mean? `s` is not a charclass. What is `s` here? – Adobe Oct 23 '12 at 08:08
Look it up, guy, [look it up.](http://docs.python.org/dev/library/re.html#regular-expression-syntax) It makes a dot match newlines along with other characters. – alexis Oct 23 '12 at 12:36
Oh. Ok then. I though I saw somewhere that pygments have `re.DOTALL` by default. But I'm not sure. – Adobe Oct 23 '12 at 15:23
Not quite, according to [this](http://pygments.org/docs/lexerdevelopment/) it's got `re.MULTILINE` on by default, which affects `^` and `$`. I always mix them up too. – alexis Oct 24 '12 at 14:50

score 3 · Accepted Answer · answered Aug 16 '12 at 21:54

Yes, the final regex isn't actually matching any characters. I tried this code:

import re

regexes = {
  "text": re.compile(r"(.+?)(?:$|\b(?=(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|PgDwn|Home|End|Del|Ins|Win|F1?[1-9])\b))"),
  "kwd": re.compile(r"(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|PgDwn|Home|End|Del|Ins|Win|F1?[1-9])\b")
}

def tokenise(state):
  while state["src"]:
    state["tok"] = "text" if state["tok"] == "kwd" else "kwd"
    #print "mode: {0:20} {1!r}".format(state["tok"].capitalize(), state["src"])

    m = regexes[state["tok"]].match(state["src"])
    if m:
      match = m.group(0)
      state["src"] = state["src"][m.end():]
      #print "  TOKEN({0}, {1!r})".format(state["tok"], match)
      yield "TOKEN({0}, {1!r})".format(state["tok"], match)


state = {
  "src": "A thing that, Tab, is AltCps or 'Win'. F8 is good, as is: F13.",
  "tok": "text"
}
print repr(state["src"])
print "\n".join(list(tokenise(state)))
print

state = {
  "src": "Alt thing that, Tab, is AltCps or 'Win'. F8 is good, as is: F13.",
  "tok": "text"
}
print repr(state["src"])
print "\n".join(list(tokenise(state)))
print

state = {
  "src": "Alt thing that, Tab, is AltCps or 'Win'. F8 is good, as is: F11",
  "tok": "text"
}
print repr(state["src"])
print "\n".join(list(tokenise(state)))
print

And it works I for the cases I tested, the text regex looks good in your code :)

Wow!.. `r"(.+?)(?:$|\b(?=(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|PgDwn|Home|End|Del|Ins|Win|F[12]?[1-9])\b))"` works. I see You are good at pygments (I could read the docs, but could quite qet what does `state` mean. But I understood that in pygments one has to much all the text). — Adobe, Aug 17 '12 at 10:33
`state` is basically just to keep track of a) what's left to parse and b) which type of token we look for next - this example should always alternate between matching a `text` token and a `kwd` token. — spiralx, Aug 21 '12 at 13:29
Do You think there's a room for comments? I'm trying to make `#` comment char: I've tried: `r"(.+?)(?:$|#.*$|\b(?=(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|PgDwn|‌Home|End|Del|Ins|Win|F[12]?[1-9])\b))"` and `r"([^#]+?)(?:$|\b(?=(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|PgDwn|‌Home|End|Del|Ins|Win|F[12]?[1-9])\b))"` followed by `(r'#.*$', Comment),`, I've also tried adding a second state: `'comment': [ (r'#.*$', Comment), ],` -- but nothing works. — Adobe, Oct 15 '12 at 09:56

Highlight a bunch of words?

Edit 1

Edit 2

2 Answers2