Elisp mechanism for converting PCRE regexps to emacs regexps

Question

I admit significant bias toward liking PCRE regexps much better than emacs, if no no other reason that when I type a '(' I pretty much always want a grouping operator. And, of course, \w and similar are SO much more convenient than the other equivalents.

But it would be crazy to expect to change the internals of emacs, of course. But it should be possible to convert from a PCRE experssion to an emacs expression, I'd think, and do all the needed conversions so I can write:

(defun my-super-regexp-function ...
   (search-forward (pcre-convert "__\\w: \d+")))

(or similar).

Anyone know of a elisp library that can do this?

Edit: Selecting a response from the answers below...

Wow, I love coming back from 4 days of vacation to find a slew of interesting answers to sort through! I love the work that went into the solutions of both types.

In the end, it looks like both the exec-a-script and straight elisp versions of the solutions would both work, but from a pure speed and "correctness" approach the elisp version is certainly the one that people would prefer (myself included).

Such a conversion doesn't seem too hard (though only the features supported by elisp regexps can be supported, not all of PCRE) and I too think it would be useful if capturing parens could be written without backslash. Why don't you start working on such a package? — Tom, Feb 02 '12 at 20:17
So do I. :) Though it may even be worth it to do it only for parens and |, because they are very frequent, so the backslashes are more annoying in these cases, and it may not be hard to do the replacements only for these as a start. — Tom, Feb 02 '12 at 20:55
there was a conversation in Emacs devel list recently.. conclusion was "it is not worthy" or something like that. — kindahero, Feb 03 '12 at 00:26
Just curious (I don't know the PCREs): what do you mean by *"\w and similar are SO much more convenient than the other equivalents*"? — Thomas, Feb 03 '12 at 01:40
kindahero: Heh. I don't doubt "it's not the right way" was in an argument somewhere. — Wes Hardaker, Feb 03 '12 at 02:27
PCRE's are perl's regexps, basically. They're much more common these days and generally used in almost everything *but* emacs. The most serious difference is the use of () matching. IE, /([a-z]+|[0-9]+)/ in everything but emacs has to be rewritten as \([a-z]+\|[0-9]+\) in emacs. But there are other differences; eg, \w can match any word basically translates to [a-zA-Z0-9] (mumble; I forget exactly). In the end, PCRE are easier to write most of the time and require less \ characters unless you're matching a lot of parens. — Wes Hardaker, Feb 03 '12 at 02:33
Maybe you could add a different syntax to `re-builder`. I have no idea how hard this would be. The recent discussion on emacs-devel was about including PCRE itself in emacs, but the maintainers want to use a regex library with good asymptotics. At least that's what I recall. — Ivan Andrus, Feb 06 '12 at 18:56
Well, I've never done a bounty before and it's worth a badge... so, here we go! Anyone want to write a conversion function and publish it for 50 points :-) — Wes Hardaker, Feb 06 '12 at 19:19
@WesHardaker: Actually Its worse. A perl-regexp `([a-z]+|[0-9]+)` would in emacs be `\([a-z]+\|[0-9]+\)`. And obviously we are not even considering the absence of a literal-string syntax (→ `"\\([a-z]+\\|[0-9]+\\)"` here. I was just writing a regexp `"\\(\\w\\w\\)-\\(\\w\\w\\)-\\(\\w\\w\\)"` right before reading here ^^ — kdb, Jun 19 '13 at 14:58

score 29 · Accepted Answer · edited Nov 21 '17 at 22:34

https://github.com/joddie/pcre2el is the up-to-date version of this answer.

pcre2el or rxt (RegeXp Translator or RegeXp Tools) is a utility for working with regular expressions in Emacs, based on a recursive-descent parser for regexp syntax. In addition to converting (a subset of) PCRE syntax into its Emacs equivalent, it can do the following:

convert Emacs syntax to PCRE

convert either syntax to rx, an S-expression based regexp syntax

untangle complex regexps by showing the parse tree in rx form and highlighting the corresponding chunks of code

show the complete list of strings (productions) matching a regexp, provided the list is finite

provide live font-locking of regexp syntax (so far only for Elisp buffers – other modes on the TODO list)

The text of the original answer follows...

Here's a quick and ugly Emacs lisp solution (EDIT: now located more permanently here). It's based mostly on the description in the pcrepattern man page, and works token by token, converting only the following constructions:

parenthesis grouping ( .. )
alternation |
numerical repeats {M,N}
string quoting \Q .. \E
simple character escapes: \a, \c, \e, \f, \n, \r, \t, \x, and \ + octal digits
character classes: \d, \D, \h, \H, \s, \S, \v, \V
\w and \W left as they are (using Emacs' own idea of word and non-word characters)

It doesn't do anything with more complicated PCRE assertions, but it does try to convert escapes inside character classes. In the case of character classes including something like \D, this is done by converting into a non-capturing group with alternation.

It passes the tests I wrote for it, but there are certainly bugs, and the method of scanning token-by-token is probably slow. In other words, no warranty. But perhaps it will do enough of the simpler part of the job for some purposes. Interested parties are invited to improve it ;-)

(eval-when-compile (require 'cl))

(defvar pcre-horizontal-whitespace-chars
  (mapconcat 'char-to-string
             '(#x0009 #x0020 #x00A0 #x1680 #x180E #x2000 #x2001 #x2002 #x2003
                      #x2004 #x2005 #x2006 #x2007 #x2008 #x2009 #x200A #x202F
                      #x205F #x3000)
             ""))

(defvar pcre-vertical-whitespace-chars
  (mapconcat 'char-to-string
             '(#x000A #x000B #x000C #x000D #x0085 #x2028 #x2029) ""))

(defvar pcre-whitespace-chars
  (mapconcat 'char-to-string '(9 10 12 13 32) ""))

(defvar pcre-horizontal-whitespace
  (concat "[" pcre-horizontal-whitespace-chars "]"))

(defvar pcre-non-horizontal-whitespace
  (concat "[^" pcre-horizontal-whitespace-chars "]"))

(defvar pcre-vertical-whitespace
  (concat "[" pcre-vertical-whitespace-chars "]"))

(defvar pcre-non-vertical-whitespace
  (concat "[^" pcre-vertical-whitespace-chars "]"))

(defvar pcre-whitespace (concat "[" pcre-whitespace-chars "]"))

(defvar pcre-non-whitespace (concat "[^" pcre-whitespace-chars "]"))

(eval-when-compile
  (defmacro pcre-token-case (&rest cases)
    "Consume a token at point and evaluate corresponding forms.

CASES is a list of `cond'-like clauses, (REGEXP FORMS
...). Considering CASES in order, if the text at point matches
REGEXP then moves point over the matched string and returns the
value of FORMS. Returns `nil' if none of the CASES matches."
    (declare (debug (&rest (sexp &rest form))))
    `(cond
      ,@(mapcar
         (lambda (case)
           (let ((token (car case))
                 (action (cdr case)))
             `((looking-at ,token)
               (goto-char (match-end 0))
               ,@action)))
         cases)
      (t nil))))

(defun pcre-to-elisp (pcre)
  "Convert PCRE, a regexp in PCRE notation, into Elisp string form."
  (with-temp-buffer
    (insert pcre)
    (goto-char (point-min))
    (let ((capture-count 0) (accum '())
          (case-fold-search nil))
      (while (not (eobp))
        (let ((translated
               (or
                ;; Handle tokens that are treated the same in
                ;; character classes
                (pcre-re-or-class-token-to-elisp)   

                ;; Other tokens
                (pcre-token-case
                 ("|" "\\|")
                 ("(" (incf capture-count) "\\(")
                 (")" "\\)")
                 ("{" "\\{")
                 ("}" "\\}")

                 ;; Character class
                 ("\\[" (pcre-char-class-to-elisp))

                 ;; Backslash + digits => backreference or octal char?
                 ("\\\\\\([0-9]+\\)"
                  (let* ((digits (match-string 1))
                         (dec (string-to-number digits)))
                    ;; from "man pcrepattern": If the number is
                    ;; less than 10, or if there have been at
                    ;; least that many previous capturing left
                    ;; parentheses in the expression, the entire
                    ;; sequence is taken as a back reference.   
                    (cond ((< dec 10) (concat "\\" digits))
                          ((>= capture-count dec)
                           (error "backreference \\%s can't be used in Emacs regexps"
                                  digits))
                          (t
                           ;; from "man pcrepattern": if the
                           ;; decimal number is greater than 9 and
                           ;; there have not been that many
                           ;; capturing subpatterns, PCRE re-reads
                           ;; up to three octal digits following
                           ;; the backslash, and uses them to
                           ;; generate a data character. Any
                           ;; subsequent digits stand for
                           ;; themselves.
                           (goto-char (match-beginning 1))
                           (re-search-forward "[0-7]\\{0,3\\}")
                           (char-to-string (string-to-number (match-string 0) 8))))))

                 ;; Regexp quoting.
                 ("\\\\Q"
                  (let ((beginning (point)))
                    (search-forward "\\E")
                    (regexp-quote (buffer-substring beginning (match-beginning 0)))))

                 ;; Various character classes
                 ("\\\\d" "[0-9]")
                 ("\\\\D" "[^0-9]")
                 ("\\\\h" pcre-horizontal-whitespace)
                 ("\\\\H" pcre-non-horizontal-whitespace)
                 ("\\\\s" pcre-whitespace)
                 ("\\\\S" pcre-non-whitespace)
                 ("\\\\v" pcre-vertical-whitespace)
                 ("\\\\V" pcre-non-vertical-whitespace)

                 ;; Use Emacs' native notion of word characters
                 ("\\\\[Ww]" (match-string 0))

                 ;; Any other escaped character
                 ("\\\\\\(.\\)" (regexp-quote (match-string 1)))

                 ;; Any normal character
                 ("." (match-string 0))))))
          (push translated accum)))
      (apply 'concat (reverse accum)))))

(defun pcre-re-or-class-token-to-elisp ()
  "Consume the PCRE token at point and return its Elisp equivalent.

Handles only tokens which have the same meaning in character
classes as outside them."
  (pcre-token-case
   ("\\\\a" (char-to-string #x07))  ; bell
   ("\\\\c\\(.\\)"                  ; control character
    (char-to-string
     (- (string-to-char (upcase (match-string 1))) 64)))
   ("\\\\e" (char-to-string #x1b))  ; escape
   ("\\\\f" (char-to-string #x0c))  ; formfeed
   ("\\\\n" (char-to-string #x0a))  ; linefeed
   ("\\\\r" (char-to-string #x0d))  ; carriage return
   ("\\\\t" (char-to-string #x09))  ; tab
   ("\\\\x\\([A-Za-z0-9]\\{2\\}\\)"
    (char-to-string (string-to-number (match-string 1) 16)))
   ("\\\\x{\\([A-Za-z0-9]*\\)}"
    (char-to-string (string-to-number (match-string 1) 16)))))

(defun pcre-char-class-to-elisp ()
  "Consume the remaining PCRE character class at point and return its Elisp equivalent.

Point should be after the opening \"[\" when this is called, and
will be just after the closing \"]\" when it returns."
  (let ((accum '("["))
        (pcre-char-class-alternatives '())
        (negated nil))
    (when (looking-at "\\^")
      (setq negated t)
      (push "^" accum)
      (forward-char))
    (when (looking-at "\\]") (push "]" accum) (forward-char))

    (while (not (looking-at "\\]"))
      (let ((translated
             (or
              (pcre-re-or-class-token-to-elisp)
              (pcre-token-case              
               ;; Backslash + digits => always an octal char
               ("\\\\\\([0-7]\\{1,3\\}\\)"    
                (char-to-string (string-to-number (match-string 1) 8)))

               ;; Various character classes. To implement negative char classes,
               ;; we cons them onto the list `pcre-char-class-alternatives' and
               ;; transform the char class into a shy group with alternation
               ("\\\\d" "0-9")
               ("\\\\D" (push (if negated "[0-9]" "[^0-9]")
                              pcre-char-class-alternatives) "")
               ("\\\\h" pcre-horizontal-whitespace-chars)
               ("\\\\H" (push (if negated
                                  pcre-horizontal-whitespace
                                pcre-non-horizontal-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\\\s" pcre-whitespace-chars)
               ("\\\\S" (push (if negated
                                  pcre-whitespace
                                pcre-non-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\\\v" pcre-vertical-whitespace-chars)
               ("\\\\V" (push (if negated
                                  pcre-vertical-whitespace
                                pcre-non-vertical-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\\\w" (push (if negated "\\W" "\\w") 
                              pcre-char-class-alternatives) "")
               ("\\\\W" (push (if negated "\\w" "\\W") 
                              pcre-char-class-alternatives) "")

               ;; Leave POSIX syntax unchanged
               ("\\[:[a-z]*:\\]" (match-string 0))

               ;; Ignore other escapes
               ("\\\\\\(.\\)" (match-string 0))

               ;; Copy everything else
               ("." (match-string 0))))))
        (push translated accum)))
    (push "]" accum)
    (forward-char)
    (let ((class
           (apply 'concat (reverse accum))))
      (when (or (equal class "[]")
                (equal class "[^]"))
        (setq class ""))
      (if (not pcre-char-class-alternatives)
          class
        (concat "\\(?:"
                class "\\|"
                (mapconcat 'identity
                           pcre-char-class-alternatives
                           "\\|")
                "\\)")))))

Fantastic work! This is exactly the sort of thing that I think is the right way to go: a straight-elisp solution posted to github. (Honestly, if no one else owned up to the task, I was going to do it and post it to github myself!). I assume you, and hopefully others, will continue to tinker on it and it will live on! I'm not sure how you could speed up scanning token-by-token though. I've thought about it in the past, and I don't think there is an easy way around it. — Wes Hardaker, Feb 14 '12 at 05:32
@WesHardaker Thanks! I'm not sure how useful it would be to convert many more of PCRE's more complicated features, but I agree it's a hassle to always have to backslash groupings etc. So hopefully the script is useful at least from that point of view. — , Feb 27 '12 at 10:33
FYI, I've hacked on it a bit more to add some documentation to the top as well as a few replacement functions (I need to do more, and will slowly over time). https://gist.github.com/1878156 If you're going to continue maintaining it, I can send you pull requests from stuff I do. [maybe it needs it's own full repo though?] — Wes Hardaker, Feb 27 '12 at 14:31
@WesHardaker A full repo sounds like a good idea. I'll set one up in the next day or two. — , Feb 28 '12 at 10:50
Thanks for creating the repo. You should edit your answer to point to it. — Wes Hardaker, Feb 28 '12 at 14:59

score 8 · Answer 2 · answered Feb 11 '12 at 10:00

I made a few minor modifications to a perl script I found on perlmonks (to take values from the command line) and saved it as re_pl2el.pl (given below). Then the following does a decent job of converting PCRE to elisp regexps, at least for non-exotic the cases that I tested.

(defun pcre-to-elre (regex)
  (interactive "MPCRE expression: ")
  (shell-command-to-string (concat "re_pl2el.pl -i -n "
                                   (shell-quote-argument regex))))

(pcre-to-elre "__\\w: \\d+") ;-> "__[[:word:]]: [[:digit:]]+"

It doesn't handle a few "corner" cases like perl's shy {N,M}? constructs, and of course not code execution etc. but it might serve your needs or be a good starting place for such. Since you like PCRE I presume you know enough perl to fix any cases you use often. If not let me know and we can probably fix them.

I would be happier with a script that parsed the regex into an AST and then spit it back out in elisp format (since then it could spit it out in rx format too), but I couldn't find anything doing that and it seemed like a lot of work when I should be working on my thesis. :-) I find it hard to believe that noone has done it though.

Below is my "improved" version of re_pl2el.pl. -i means don't double escape for strings, and -n means don't print a final newline.

#! /usr/bin/perl
#
# File: re_pl2el.pl
# Modified from http://perlmonks.org/?node_id=796020
#
# Description:
#
use strict;
use warnings;

# version 0.4


# TODO
# * wrap converter to function
# * testsuite

#--- flags
my $flag_interactive; # true => no extra escaping of backslashes
if ( int(@ARGV) >= 1 and $ARGV[0] eq '-i' ) {
    $flag_interactive = 1;
    shift @ARGV;
}

if ( int(@ARGV) >= 1 and $ARGV[0] eq '-n' ) {
    shift @ARGV;
} else {
    $\="\n";
}

if ( int(@ARGV) < 1 ) {
    print "usage: $0 [-i] [-n] REGEX";
    exit;
}

my $RE='\w*(a|b|c)\d\(';
$RE='\d{2,3}';
$RE='"(.*?)"';
$RE="\0".'\"\t(.*?)"';
$RE=$ARGV[0];

# print "Perlcode:\t $RE";

#--- encode all \0 chars as escape sequence
$RE=~s#\0#\\0#g;

#--- substitute pairs of backslashes with \0
$RE=~s#\\\\#\0#g;

#--- hide escape sequences of \t,\n,... with
#    corresponding ascii code
my %ascii=(
       t =>"\t",
       n=> "\n"
      );
my $kascii=join "|",keys %ascii;

$RE=~s#\\($kascii)#$ascii{$1}#g;


#---  normalize needless escaping
# e.g.  from /\"/ to /"/, since it's no difference in perl
# but might confuse elisp

$RE=~s#\\"#"#g;

#--- toggle escaping of 'backslash constructs'
my $bsc='(){}|';
$RE=~s#[$bsc]#\\$&#g;  # escape them once
$RE=~s#\\\\##g;        # and erase double-escaping



#--- replace character classes
my %charclass=(
        w => 'word' ,   # TODO: emacs22 already knows \w ???
        d => 'digit',
        s => 'space'
       );

my $kc=join "|",keys %charclass;
$RE=~s#\\($kc)#[[:$charclass{$1}:]]#g;



#--- unhide pairs of backslashes
$RE=~s#\0#\\\\#g;

#--- escaping for elisp string
unless ($flag_interactive){
  $RE=~s#\\#\\\\#g; # ... backslashes
  $RE=~s#"#\\"#g;   # ... quotes
}

#--- unhide escape sequences of \t,\n,...
my %rascii= reverse %ascii;
my $vascii=join "|",keys %rascii;
$RE=~s#($vascii)#\\$rascii{$1}#g;

# print "Elispcode:\t $RE";
print "$RE";
#TODO whats the elisp syntax for \0 ???

Very cool script, and certainly handy to have around. I'd have taken this solution as the bounty-winner if there wasn't another option, but because it needs to invoke an external script, I don't think it's the best choice if for no other reason than executing the external script is simply slower (and more memory consuming too, for that matter). Nicely done, though! I'll +1 the answer anyway. — Wes Hardaker, Feb 14 '12 at 05:29

score 2 · Answer 3 · answered Feb 07 '12 at 17:04

2

The closest previous work on this have been extensions to M-x re-builder, see

http://www.emacswiki.org/emacs/ReBuilder

or the work of Ye Wenbin on PDE.

http://cpansearch.perl.org/src/YEWENBIN/Emacs-PDE-0.2.16/lisp/doc/pde.html

answered Feb 07 '12 at 17:04

ashawley

4,195
1
27
40

1

Although they're cool tools, they don't do what I'm asking for. They only work with emacs regexps. – Wes Hardaker Feb 07 '12 at 21:19
Ah, somehow I missed that's what PDE did in the original reading of that message. I think it's workable, but certainly I think the running another perl process isn't necessarily the best approach. But I always love a good hack. At least until a cleaner solution comes along. – Wes Hardaker Feb 14 '12 at 05:33

score 0 · Answer 4 · answered Jan 21 '17 at 04:23

0

Possibly relevant is visual-regexp-steroids, which extends query-replace to use a live preview and allows you to use different regexp backends, including PCRE.

answered Jan 21 '17 at 04:23

Resigned June 2023

4,638
3
38
49

Elisp mechanism for converting PCRE regexps to emacs regexps

4 Answers4

Linked