2

I am looking for a regex pattern in Java that corresponds to all characters except the letters a to z.

In other words, I want a regex pattern that corresponds to symbols such as

 !"#¤%&/()=?`´\}}][{€$@

Or some way to trim a string into letters only.

As an example lets consider the following string:

 "one!#"¤%()=) two}]}[()\ three[{€$"

to:

 "one two three"
tchrist
  • 78,834
  • 30
  • 123
  • 180
James Ford
  • 949
  • 4
  • 12
  • 25

4 Answers4

4

The Unicode version would be

\PL

\PL are all Unicode code points that does not have the property "Letter".

\pL would be the counterpart, all Unicode code points that does have the property "Letter".

Maybe you can fine here on regular-expressions.info some properties that match your needs better.

You can also combine them into character classes, the same than you would handle predefined classes, e.g.

[^\pl\pN]

Would match any character that is not a letter or a digit numeric character in Unicode.

stema
  • 90,351
  • 20
  • 107
  • 135
  • 1
    Technically, `\pN` includes nondigits. `\p{Nd}` is just the decimal digits. `\pN` also includes `\p{Nl}` for letter numbers like Roman numerals, and `\p{No}` for things like vulgar fractions, superscripts, and subscripts. I upvoted you anyway, because you definitely have the right idea. BTW, it looks like he’s wanting to retain spaces. I don’t know if that means any Unicode whitespace or just a literal space. – tchrist Feb 29 '12 at 15:20
  • @tchrist of course you are correct. The correct term was "numeric character" and I wrote "digit". – stema Feb 29 '12 at 15:23
3

As an example lets consider the following string:

 "one!#"¤%()=) two}]}[()\ three[{€$"

to:

 "one two three"

The pattern needed is to match everything that is neither a letter nor a separator. Otherwise you would end up with "onetwothree" instead of the "one two three" you asked for.

[^\pL\pZ]
Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
1

[^a-zA-Z] is a character class that matches every character apart from the letters a to z in lower or upper case.

Richard
  • 9,740
  • 2
  • 23
  • 15
1

The simplest form : [^a-z]

Could also be [^a-zA-Z] if you want to remove uppercase letters also.

huelbois
  • 6,762
  • 1
  • 19
  • 21
  • So how would I trim a string, say "one?! two# three,.\][" to "one two three"? – James Ford Feb 29 '12 at 15:10
  • Won’t that turn `façade` into `faade`? – tchrist Feb 29 '12 at 15:13
  • @tchrist I would think so - I guess it depends upon what language you're parsing - if you're parsing good old American English, for instance, you probably won't have to worry about graves, cedillas, umlauts, circumflexes and the like none too much, I reckon. – Code Jockey Feb 29 '12 at 15:35
  • @CodeJockey That’s not true! *Properly written* English certainly includes diacritics! The **Oxford *English* Dictionary** attests such **English** words as: *Allerød, après-ski, Bokmål, brassière, caña, crème, crêpe, désœuvrement, Fabergé, façade, fête, feuilleté, flügelhorn, flügelhorn, Gödelian, jalapeño, Madrileño, Möbius, Mohorovičić discontinuity, moiré, naïve, Niçoise, piñon, plaçage, prêt-à-porter, Provençal, quinceañera, Ragnarök, résumé, Schrödinger’s cat, Shijō, smørrebrød, soirée, tapénade, vicuña, vis-à-vis, Zuñi, α-ketoisovaleric, (α-)lipoic, (β-)nornicotine,* and *ψ-ionone.* – tchrist Feb 29 '12 at 15:42
  • Does the Oxford English Dictionary describe "**good old American** English" - or is it the more proper form of English - I know many manuscripts - especially older ones - include diacritics of all sorts. (some even use `f` instead of `s`???) International mappings for keyboards also include the ability to easily add diacritics for those wanting to properly adorn their typed characters (personally, I just open up 'character map') - most forum posts on American English sites are diacritic-free, however. Again, it depends upon what you're parsing and the _facade_ the writers intended to portray. – Code Jockey Feb 29 '12 at 15:51
  • @tchrist perhaps I should have used something like "vernacular", "bastardized English", or "commonly used spellings" - I challenge you to find a person that adds diacritics to their English-based text messages - though, I'm sure someone out there does it... I often try to add correct punctuation and capitalization to my own text messages, for the sake of clarity. – Code Jockey Feb 29 '12 at 15:57
  • @tchrist let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/8351/discussion-between-code-jockey-and-tchrist) – Code Jockey Feb 29 '12 at 16:04
  • @CodeJockey More Bringhurst: “To the 600‐character globalized Latin alphabet, mathematicians, grammarians, chemists, and even typographers are prone to make additions: arabic numerals, punctuation, technical symbols, letters borrowed from Hebrew, Greek, and Cyrillic, [...] [A]uthors, editors, typographers, and ordinary citizens who just want to be able to spell Dvořák, Miłosz, Mą’ii, or al‐Fārābī, or to quote a line of Sophocles or Pushkin, or the Vedas or the Sutras or the Psalms, or to write φ ≠ π, are beneficiaries of a system this inclusive.” – tchrist Feb 29 '12 at 16:05
  • @tchrist : same problem in French. For a long time, it wasn't technically (printing, computers) possible to use accents (for example) on uppercase letters. Now that it's possible on every computer (but not always very clear how to do it), people tend to think diacritics are not required on uppercases. BTW, on my french-configured Debian squeeze, grep "[a-z]" DOES grep ï also (but not Ï of course). – huelbois Feb 29 '12 at 16:06
  • @huelbois The respective academies of French and Spanish tell you that one must retain diacritics even on uppercase but alas people often omit them anyway. That’s a very fancy Debian configuration if your `[a-z]` actually includes [French à, â, ç, é, è, ê, ë, î, ï, ô, œ, û, ù, ü](http://en.wikipedia.org/wiki/French_orthography#Diacritics). I’d normally just use `\pL` to write programs that aren’t being locale-sensitive, since locale support is very shoddy between vendors, and then if needed specify an exact French `[àâçéèêëîïôœûùü]` elsewhere. I prefer to work in NFD not NFC, hence `\pL\pM*+`. – tchrist Feb 29 '12 at 16:20