0

I am trying to write a regular expression, in java, that matches words and hyphenated words. So far I have:

Pattern p1 = Pattern.compile("\\w+(?:-\\w+)",Pattern.CASE_INSENSITIVE);
Pattern p2 = Pattern.compile("[a-zA-Z0-9]+",Pattern.CASE_INSENSITIVE);
Pattern p3 = Pattern.compile("(?<=\\s)[\\w]+-$",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

This is my test case:

    Programs
    Dsfasdf. Programs Programs Dsfasdf. Dsfasdf. as is wow woah! woah. woah? okay. 
    he said, "hi." aasdfa. wsdfalsdjf. go-to go-
to
asdfasdf.. , : ; " ' ( ) ? ! - / \ @ # $ % & ^ ~ `  * [ ] { } + _ 123

Any help would be awesome

My expected result would be to match all the words ie.

Programs Dsfasdf Programs Programs Dsfasdf Dsfasdf
as is wow woah woah woah okay he said hi aasdfa
wsdfalsdjf go-to go-to asdfasdf 

the part I'm struggling with is matching the words that are split up between lines as one word.

ie.

go-
to
razlebe
  • 7,134
  • 6
  • 42
  • 57
MacAttack
  • 25
  • 6

2 Answers2

3
\p{L}+(?:-\n?\p{L}+)*
\   /^\ /^\ /\   /^^^
 \ / | | | |  \ / |||
  |  | | | |   |  ||`- Previous can repeat 0 or more times (group of literal '-', optional new-line and one or more of any letter (upper/lower case))
  |  | | | |   |  |`-- End first non-capture group
  |  | | | |   |  `--- Match one or more of previous (any letter, upper/lower case)
  |  | | | |   `------ Match any letter (upper/lower case)
  |  | | | `---------- Match a single new-line (optional because of `?`)
  |  | | `------------ Literal '-'
  |  | `-------------- Start first non-capture group
  |  `---------------- Match one or more of previous (any letter between A-Z (upper/lower case))
  `------------------- Match any letter (upper/lower case)

Is this OK?

ohaal
  • 5,208
  • 2
  • 34
  • 53
  • Was that generated automatically? If so, where is this from? – Ditmar Wendt Jun 27 '12 at 22:52
  • What about for words split up between lines? In my test case the go-to is split up between 2 lines. I did have that (basically) as p1 in my question minus the ? at the end. – MacAttack Jun 28 '12 at 15:46
  • Did not realize it should be valid. Updated regex to reflect requested changes. – ohaal Jun 28 '12 at 16:19
1

I would go with regex:

\p{L}+(?:\-\p{L}+)*

Such regex should match also words "fiancé", "À-la-carte" and other words containing some special category "letter" characters. \p{L} matches a single code point in the category "letter".

Ωmega
  • 42,614
  • 34
  • 134
  • 203