RegEx to match acronyms

Question

I am trying to write a regular expression that will match values such as U.S., D.C., U.S.A., etc.

Here is what I have so far -

\b([a-zA-Z]\.){2,}+

Note how this expression matches but does not include the last letter in the acronym.

Can anyone help explain what I am missing here?

SOLUTION

I'm posting the solution here in case this helps anyone.

\b(?:[a-zA-Z]\.){2,}

It seems as if a non-capturing group is required here.

How is that different from what I already have? I have added the word boundary restriction along with a restriction that says the acronym must have at least two letters. — Randy, Jan 29 '16 at 02:49
@nu11p01n73r, see my screenshot above. I think it's well illustrated. — Randy, Jan 29 '16 at 02:52
Yeah, in that the last character is in different colour because it is captured in group 1. It doesn't mean that the string is matched. The string is matched and the last letter say `C` is in group 1 — nu11p01n73R, Jan 29 '16 at 02:53
It's matched but it's recognized as a separate match which is not what I want. — Randy, Jan 29 '16 at 02:54
Okay, then the non capturing groups will prevent it. And also you can always get the entire match from group 0. eg `D.C.` will be in group 0. — nu11p01n73R, Jan 29 '16 at 02:57
If you have a solution, you should post it in an answer and accept it, not post it in your question — Brendan Abel, Jan 29 '16 at 18:15

Leromul · Accepted Answer · 2016-01-29T03:46:01.223

9

Try (?:[a-zA-Z]\.){2,}

?: (non-capturing group) is there because you want to omit capturing the last iteration of the repeated group.

For example, without ?:, 'U.S.A.' will yield a group match 'A.', which you are not interested about.

edited Jan 29 '16 at 03:46

answered Jan 29 '16 at 02:49

Leromul

316
1
2
6

Leromul, your answer is correct but for completeness I think it would be helpful to myself as well as others to provide an explanation for why this works versus why my solution doesn't. Thanks! – Randy Jan 29 '16 at 02:59

Ramfjord · Answer 2 · 2016-01-29T03:32:00.090

2

None of these proposed solutions do what yours does - make sure that there are at least 2 letters in the acronym. Also, yours works on http://rubular.com/ . This is probably some issue with the regex implementation - to be fair, all of the matches that you got were valid acronyms. To fix this, you could either:

Make sure there's a space or EOF succeeding your expression ((?=\s|$) in ruby at least)
Surround your regex with ^ and $ to make sure it catches the whole string. You'd have to split the whole string on spaces to get matches with this though.

I prefer the former solution - to do this you'd have:

\b([a-zA-Z]\.){2,}(?=\s|$)

Edit: I've realized this doesn't actually work with other punctuation in the string, and a couple of other edge cases. This is super ugly, but I think it should be good enough:

(?<=\s|^)((?:[a-zA-Z]\.){2,})(?=[[:punct:]]?(?:\s|$))

This assumes that you've got this [[:punct:]] character class, and allows for 0-1 punctuation marks after an acronym that won't be captured. I've also fixed it up so that there's a single capture group that gets the whole acronym. Check out validation at http://rubular.com/r/lmr0qERLDh

Bonus: you now get to make this super confusing to anyone reading it.

edited Jan 29 '16 at 03:32

answered Jan 29 '16 at 02:59

Ramfjord

872
8
14

This is close, but the word may or may not be at the end of the string. Ultimately, I ended up with \b(?:[a-zA-Z]\.){2,} and it's doing what I want. – Randy Jan 29 '16 at 03:01
The lookahead solution I posted matches either a space or the end of a string. But if yours is working for you, go ahead. – Ramfjord Jan 29 '16 at 03:03
Just for the record, I tested this and it's doing the same thing mine is doing. It's not matching the last letter of the acronym. Could this be language dependent? – Randy Jan 29 '16 at 03:11
No, I don't think that could be an implementation issue... I've also just realized that some of your examples are confusingly matching one letter acronyms - e.g. `D.C.` is matching `D.`, which should be disallowed by the `{2,}` that you have. Also, it occurs to me that your regex would probably match some non acronym strings: http://rubular.com/r/jjCrtADKsV . – Ramfjord Jan 29 '16 at 03:17
Looks like you are correct about that. In my case it's okay because all words are already tokenized. – Randy Jan 29 '16 at 03:18

Skaparate · Answer 3 · 2016-01-29T03:01:22.953

1

This should work:

/([a-zA-Z]\.)+/g

edited Jan 29 '16 at 03:01

answered Jan 29 '16 at 02:50

Skaparate

489
3
14

Thanks for the suggestion but this doesn't actually work. For some reason a non-capturing group is required. – Randy Jan 29 '16 at 02:51
That's weird. Are you using any particular language (Java, C#, etc.)? – Skaparate Jan 29 '16 at 02:52

score 1 · Answer 4 · answered May 04 '22 at 09:50

1

I have slightly modified the solution above:

\b(?:[a-zA-Z]+\.){2,}

to enable capturing acronyms containing more than one letter between the dots, like in 'GHQ.AFP.X.Y'

answered May 04 '22 at 09:50

Stanislav Koncebovski

462
5
10

1

I used \b(?:[a-zA-Z]\.){2,} so it won't capture words like Ph.D. – Brandalf Sep 22 '22 at 20:43

RegEx to match acronyms

4 Answers4

Linked