4

I have to separate a line of text into words, and am confused on what regex to use. I have looked everywhere for a regex that matches a word and found ones similar to this post but want it in java (java doesn't handle \ in regular strings).

Regex to match words and those with an apostrophe

I have tried the regex for each answer and am unsure of how to structure a regex for java for this (i assumed all regex were the same). If replace \ by \ in the regex i see, the regex doesn't work.

I have also tried looking it up myself and have come to this page: http://www.regular-expressions.info/reference.html

But I cannot wrap my head around regex advanced techniques.

I am using String.split(regex string here) to separate my string. an example is if I'm given the following: "I like to eat but I don't like to eat everyone's food, or they'll starve." I want to match:

I
like
to
eat
but
I
don't
like
to
eat
everyone's
food
or
they'll
starve

I also don't want to match '' or '''' or ' ' or '.'' or other permutations. My delimiter conditions should be similar to: [match any word character][also match an apostrophe if it is preceded by a word character and then match word characters after it if there are any]

What i got is just a simple regex that matches words [\w], but i am unsure of how to use lookahead or look behind to match the apostrophe and then the remaining words.

Community
  • 1
  • 1
Richard Duerr
  • 566
  • 8
  • 24
  • duplicate: http://stackoverflow.com/questions/2596893/regex-to-match-words-and-those-with-an-apostrophe – Alex Nov 29 '12 at 18:56
  • 3
    why not split on whitespace? `yourString.split("\\s+")`; – jlordo Nov 29 '12 at 18:58
  • @Alex Not really, he's asking the same question but for a different language. (Python 3.x vs Java) which would have different answers. – Nick Nov 29 '12 at 18:58
  • @Nick It involves the regex not really Java itself, no ? – Alex Nov 29 '12 at 18:59
  • 1
    @Alex No, he stated he tried all the answers in that question, but he's asking how to get Java regex to match words with the apostrophes. The way Python and Java implement Regex are different, so there will be slight differences between how the expressions are called. If someones not familiar with both languages, translating the regex's between the two might not be straight forward. For example, not all languages support Look-behinds, so the expressions would be different from one to another. – Nick Nov 29 '12 at 19:02
  • if i split on whitespace it doesn't weed out nonsense things such as .. or /?' or the like. I want to use this regex in java so i would assume a java friendly regex expression would be a good answer. I realize my question is very similar to the other one, but this is for java, and i did not want to derail his question by asking one of my own in the comments. – Richard Duerr Nov 29 '12 at 19:16
  • @Nick yes you are right, regex can be implemented differently in Java and Python. I added an answer with a regular expression extracted from the page linked above. It appears to be working as OP wants. – Alex Nov 29 '12 at 19:28
  • 1
    The regex "\\w+('\\w+)*'?" seems to cut out all words except the punctuation and such. I gave it the sentence "Hello, World! Don't eat someone's sandwhich. Peoples'." and it gave(seperated by -): ""-", "-"! "-" "-" "-" "-". "-"." This seems like it picks out delimitters correctly, but how do i get it to get words? – Richard Duerr Nov 29 '12 at 20:22

2 Answers2

4

Using answer from WhirlWind on the page stated in my comment you can do the following:

String candidate = "I \n"+
    "like \n"+
    "to "+
    "eat "+
    "but "+
    "I "+
    "don't "+
    "like "+
    "to "+
    "eat "+
    "everyone's "+
    "food "+
    "''  ''''  '.' ' "+
    "or "+
    "they'll "+
    "starv'e'";

String regex = "('\\w+)|(\\w+'\\w+)|(\\w+')|(\\w+)";
Matcher matcher = Pattern.compile(regex).matcher(candidate);
while (matcher.find()) {
  System.out.println("> matched: `" + matcher.group() + "`");
}

It will print:

> matched: `I`
> matched: `like`
> matched: `to`
> matched: `eat`
> matched: `but`
> matched: `I`
> matched: `don't`
> matched: `like`
> matched: `to`
> matched: `eat`
> matched: `everyone's`
> matched: `food`
> matched: `or`
> matched: `they'll`
> matched: `starv'e`

You can find a running example here: http://ideone.com/pVOmSK

Alex
  • 25,147
  • 6
  • 59
  • 55
  • It doesn't seem to work in java String.split(String s) method. I get empty strings and some other delimitters. Here is a screencap of my code and result from BlueJ:http://i1186.photobucket.com/albums/z379/Richard_Duerr/regexProb.png – Richard Duerr Nov 29 '12 at 22:26
  • I have tried to "invert" those conditions as Split is looking for deliminator, so i want a deliminator to NOT be any number of word characters followed or preceded by an apostrophe, where he apostrophe is optional. – Richard Duerr Nov 29 '12 at 22:45
  • I have figured that this regex is very close: "[^a-zA-Z0-9']+" which works for every case except where an apostrophe is after a series of alphanumerics. – Richard Duerr Nov 30 '12 at 00:21
  • If you want to find word also containing apostrophes you cannot just simple on a simple delimiter. `[^a-zA-Z0-9']+` means that it will split on anything (that is repeated) except alphanumeric and apostrophe but it won't split something having multiple apostrophes in it. If it is fine with you then go with it. – Alex Nov 30 '12 at 05:47
  • That will break words like "T-Mobile" or "U.K." into two. Here's a regexp that handles that: `"Hey y'all, use T-Mobile & 23andme.com in the U.K.! Thanks.".match(/[\w'-.]+\w|[\w'-]+\s*/g)` – Dan Dascalescu Apr 17 '14 at 08:49
0

The following regex seems to cover your sample string correctly. But it doesn't cover you scenario for the apostrophe.

[\s,.?!"]+

Java Code:

String input = "I like to eat but I don't like to eat everyone's food, or they'll starve.";
String[] inputWords = input.split("[\\s,.?!]+");

If I understand correctly, the apostrophe should be left alone as long as it is after a word character. This next regex should cover the above plus the special case for the apostrophe.

(?<!\w)'|[\s,.?"!][\s,.?"'!]*

Java Code:

String input = "I like to eat but I don't like to eat everyone's food, or they'll starve.";
String[] inputWords = input.split("(?<!\\w)'|[\\s,.?\"!][\\s,.?\"'!]*");

If I run the second regex on the string: Hey there! Don't eat 'the mystery meat'. I get the following words in my string array:

Hey
there
Don't
eat
the
mystery
meat'
Francis Gagnon
  • 3,545
  • 1
  • 16
  • 25
  • That will break words like "T-Mobile" or "U.K." into two. Here's a regexp that handles that: `"Hey y'all, use T-Mobile & 23andme.com in the U.K.! Thanks.".match(/[\w'-.]+\w|[\w'-]+\s*/g)` – Dan Dascalescu Apr 17 '14 at 08:50