3

Where I can see codes for the predefined patterns for Regular Expression in R? The documentation says it is related to locales/POSIX locale.

   > [[:alpha:]]
   > [:alpha:]

Does not print anything. How to look for predefined patterns and the functions for how many times it should match etc.

Any help is highly appreciated.

oguz ismail
  • 1
  • 16
  • 47
  • 69
Sowmya S. Manian
  • 3,723
  • 3
  • 18
  • 30
  • you have to use the above regex in R regex functions like `gregexpr`, etc – Avinash Raj Sep 21 '16 at 07:54
  • I know that, as you use `"[[:digit:]]"` inside pattern argument. I want to know how they have created these patterns `[:digit:]`, `[:blank:]`. As we are just using it because it is predefined. Lets say I want to create one predefined pattern lets say `[:Avinash:]`. How should I create it, What class of object it is. etc. – Sowmya S. Manian Sep 21 '16 at 07:57
  • 1
    Short answer: you can't without modifying the source code of the regex interpreter. They are just keyword for the regex interpreter, they'll be replaced by their character class before evaluation – Tensibai Sep 21 '16 at 07:59
  • 1
    @SowmyaS.Manian no you can't.. THose are predefined POSIX regex classes.. – Avinash Raj Sep 21 '16 at 07:59
  • 3
    It doesn't make much sense to do that. If you want to find `"My_pattern"` just put that in the `pattern` argument. – Rich Scriven Sep 21 '16 at 08:00
  • @RichScriven Its just a question. If I want to have it in the same way like `[: XYZ:]` Is there a way to access existing ones. Thats it. – Sowmya S. Manian Sep 21 '16 at 08:01
  • So are they like java code or something? – Sowmya S. Manian Sep 21 '16 at 08:04
  • 2
    @SowmyaS.Manian So again, NO unless you modify [this](https://github.com/wch/r-source/blob/e5b21d0397c607883ff25cca379687b86933d730/src/extra/tre/tre-parse.c) and recompile R. This will only work in your own compiled version of R. So I'm pretty sure this is not what you're after. – Tensibai Sep 21 '16 at 08:07
  • 1
    http://stackoverflow.com/documentation/regex/1757/character-classes/17891/posix-character-classes#t=201609210808578561416 – Rich Scriven Sep 21 '16 at 08:09
  • Ok Thank you guyz. That helps. I'll check on those classes. I hope this question was not wrong to ask in here. Was just curious how they have predefined these classes for regular expressions. – Sowmya S. Manian Sep 21 '16 at 08:09
  • I think you have expressed your idea incorrectly. You seem to want to shorten your long patterns with repeating subpatterns. Just use variables and build the final regex pattern from them. `manian_class <- "[A-Za-z~!@#$%^&*()_.-]"` -> `reg <- paste0(manian_class,"+(?:\\s+",manian_class,"+)*")`. Something like this. – Wiktor Stribiżew Sep 21 '16 at 08:23
  • Can you please edit your question, so that it asks only one question, i.e., how to see what these character classes match? – Roland Sep 21 '16 at 08:47

1 Answers1

4

First we read help("regex"):

[:lower:]
Lower-case letters in the current locale.

Similar for [:upper:] and [:alpha:] is just the union of them.

Then we can check the current locale's character set:

Sys.getlocale("LC_CTYPE")
#[1] "German_Germany.1252"

l10n_info()
#$MBCS
#[1] FALSE
#
#$`UTF-8`
#[1] FALSE
#
#$`Latin-1`
#[1] TRUE
#
#$codepage
#[1] 1252

Then we can go to the internet and e.g. to Wikipedia.

Then we can try this:

gsub("[^[:alpha:]]", "", rawToChar(as.raw(1:(16^2-1))))
#[1] "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ"
gsub("[^[:cntrl:]]", "", rawToChar(as.raw(1:(16^2-1))))
#[1] "\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037\177€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ"
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Thanks Roland. That helps to start from somewhere digging into those classes. You think can I do something like `str()` on these predefined classes or patterns? Because these are values used for `pattern` argument – Sowmya S. Manian Sep 21 '16 at 09:05
  • 2
    You really seem to be confused. Regular expressions are handled by the regex engine and not by R. You can consider this a different programming language with which R interfaces. R only passes and receives character strings to and from the regex engine. Thus, to R these classes are just character strings. – Roland Sep 21 '16 at 11:00
  • Oh ok. Now this helps. Thank you. I have started working on regular expressions in R. Execution wise everything is fine. Just questions are popping up. – Sowmya S. Manian Sep 21 '16 at 11:41