0

I'm new about regex. I want to extract address line in Turkish text. but in turkish there is no standart while writing address. For instance, district = mahalle

You write district for types below

"Mah." "Mh." "MAH." "MH" "mh." "mah." or "mahalle"

regex = ((.*)((\b[Mm][Aa]?[Hh].?)(.*)))

The regex is find all types of district except last one.

Two possible types of district; 1. "mah. mh. " 2. "mahalle"

How can i find it same regex sentence?

Note: i don't want to | (or)  statement.  .... .... | (.*)mahalle(.*)
babeyh
  • 659
  • 2
  • 7
  • 19
  • 1
    Could you precise what you need? I think you do not want to use `|` because you are not aware of a non-capturing group `(?:...)`? What should be captured and what not? – Wiktor Stribiżew Mar 07 '16 at 15:33
  • This sounds like an XY problem. You have a problem, someone told you that regex were very sexy, and now, you have two problems. – Thomas Ayoub Mar 07 '16 at 15:35
  • I want to capture full address line district,street, etc. but in turkish you can use abbreviation words or full word. for example "street" i want to capture line includes either "street" or "st." – babeyh Mar 07 '16 at 15:37
  • Could you give us some *real* input you'll have to deal with (match **with** context) – Thomas Ayoub Mar 07 '16 at 16:01
  • 1
    It is not recommended to use REGEX for street addresses of any kind because they tend to have irregular patterns and REGEX relies on regular patterns. https://smartystreets.com/articles/regular-expressions-for-street-addresses – camiblanch Mar 07 '16 at 17:15
  • Forget turkish. I want to capture both "abc." and "abcde" in same regex command, is it possible or not ? – babeyh Mar 07 '16 at 17:26
  • [Yes it is.](https://regex101.com/r/hO6kB2/1) – Thomas Ayoub Mar 07 '16 at 17:43
  • [`(?i)\bma?h\.?(?:alle\b)?`](https://regex101.com/r/bB4iQ6/1)? But this can also match `mah.alle`... – Wiktor Stribiżew Mar 07 '16 at 18:35
  • thx for all :) `(.*)([Mm][Aa]?[Hh]\.?)(.*)` it solves the problem :) – babeyh Mar 12 '16 at 13:06

1 Answers1

0

Since there aren't many options to begin with, you can use OR operator to avoid complexity. Take a look at how stanford nlp does it with us states: ABSTATE = Ala|Ariz|[A]z|[A]rk|Calif|Colo|Conn|Ct|Dak|[D]el|Fla|Ga|[I]ll|Ind|Kans?|Ky|[L]a|[M]ass|Md|Mich|Minn|[M]iss|Mo|Mont|Neb|Nev|Okla|[O]re|[P]a|Penn|Tenn|[T]ex|Va|Vt|[W]ash|Wisc?|Wyo

so taking our example: Mah.|Mh.|MAH.|MH|mh.|mah.|mahalle. You can of course simplify this by using case insensitive flag to cover Mah./MAH./mah..

Victor G.
  • 425
  • 5
  • 14