0

I'm writing a ST3 snippet that inserts a \subsection{} with a label. The label is created by converting the header text to conform with the LaTeX standards for labels using a (rather lengthy) regular expression:

${1/(?:([ \t_]+)?|\b)(?:([ÅÄÆÁÀÃ])?|\b)(?:([åäæâàáã])?|\b)(?:([ÉÈÊË])?|\b)(?:([éèëê])?|\b)(?:([ÌÌÎÏ])?|\b)(?:([íìïî])?|\b)(?:([Ñ])?|\b)(?:([ñ])?|\b)(?:([ÖØÓÒÔÖÕ])?|\b)(?:([öøóòôõ])?|\b)(?:([ÜÛÚÙ])?|\b)(?:([üûúù])?|\b)/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:N)(?9:n)(?10O)(?11:o)(?12:U)(?13:u)/g}

Actually, I would like for it to be even longer. But if I add the extra groups that I would like, then ST3 crashes when I execute the snippet.

${1/(?:([ \t_]+)?|\b)(?:([ÅÄÆÁÀÃ])?|\b)(?:([åäæâàáã])?|\b)(?:([Ç])?|\b)(?:([ç])?|\b)(?:([ÉÈÊË])?|\b)(?:([éèëê])?|\b)(?:([ÌÌÎÏ])?|\b)(?:([íìïî])?|\b)(?:([Ñ])?|\b)(?:([ñ])?|\b)(?:([ÖØÓÒÔÖÕ])?|\b)(?:([öøóòôõ])?|\b)(?:([ÜÛÚÙ])?|\b)(?:([üûúù])?|\b)(?:([Ý])?|\b)(?:([ÿý])?|\b)/(?1:-)(?2:A)(?3:a)(?4:C)(?5:c)(?6:E)(?7:e)(?8:I)(?9:i)(?10:O)(?11:o)(?12:N)(?13:n)(?14:U)(?15:u)(?16:Y)(?17:y)/g}

Is there any more efficient way of doing this? Preferably one that won't cause ST3 to crash ;)

Edit: Here are some example strings:

Flygande bæckasiner søka hwila på mjuka tuvor
Åke Staël hade en överflödig idé 

And the results (with the current, working regex):

Flygande-backasiner-soka-hwila-pa-mjuka-tuvor
Ake-Stael-hade-en-overflodig-ide

But I would like to also replace the characters (ÇçÝÿý) with their unaccented counterparts (CcYyy) so that e.g.

Comment ça va

becomes

Comment-ca-va
Community
  • 1
  • 1
Fredrik P
  • 682
  • 1
  • 8
  • 21
  • `(?:([ÅÄÆÁÀÃ])?|\b)` is an always true pattern because the first alternative can be empty, remove the useless `?` quantifier. Give several example strings and expected results. – Casimir et Hippolyte Mar 16 '15 at 21:07

1 Answers1

1

I don't know this syntax, but I suspect that the problem comes from the too many optional groups combined with a lot of alternatives that cause a too complex processing.

So you can try to design your pattern like this, and you can add other groups of letters in the same way (take a look at the unicode table to find character ranges):

${1/([ \t_]+)|([À-Å])|([à-å])|([È-Ë])|([è-ë])|([Ì-Ï])|([ì-ï])|([Ò-ÖØ])|([ò-öø])|([Ù-Ü])|([ù-ü])|(Æ)|(æ)|(Œ)|(œ)|(Ñ)|(ñ)/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:O)(?9:o)(?10:U)(?11:u)(?12:AE)(?13:ae)(?14:OE)(?15:oe)(?16:N)(?17:n)/g}

if the lookahead feature is available you can improve this pattern to prevent non-accented characters to be tested with each alternatives:

${1/(?=[ \t_À-ÆÈ-ÏÑ-ÖØ-Üà-æè-ïñ-öø-üŒœ])(?:([ \t_]+)|([À-Å])|([à-å])|([È-Ë])|([è-ë])|([Ì-Ï])|([ì-ï])|([Ò-ÖØ])|([ò-öø])|([Ù-Ü])|([ù-ü])|(Æ)|(æ)|(Œ)|(œ)|(Ñ)|(ñ))/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:O)(?9:o)(?10:U)(?11:u)(?12:AE)(?13:ae)(?14:OE)(?15:oe)(?16:N)(?17:n)/g}

Note: Æ (Aelig) must be transliterated as AE (the same for Œ => OE)

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125