1

I have a list of addresses that are generally of the following type:

1000 Currie AV Apt: Minneapolis MN 55403

1843 Polk ST NE Apt: b

1801 3 AV S Apt: 203 Minneapolis MN 55404

2900 Thomas AV S Apt: 1618 MPLS MN 55416

8409 Elliott AV S Apt: Bloomington MN 55420

I am new to regular expressions.

I would like to replace Apt: and all the text until the first capital letter with a blank.

Right now the code that I am trying is the following:

generate address_home = regexr(address_home1, "(Apt:).*?([A-Z])", " ")
Community
  • 1
  • 1
WCB
  • 13
  • 1
  • 3

5 Answers5

1

Regex:

Apt:[^A-Z\n]*

Replace the matched characters with a single space.

DEMO

I think your code would be,

gen address_home = regexr(address_home1, "Apt:[^A-Z\n]*", " ")

OR

gen address_home = regexr(address_home1, "Apt:[^A-Z\\n]*", " ")

Don't know whether you need to escape the backslash one more time or not.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Thanks so much! This worked. I am not sure why the regex above did not work. I tried the code suggested and it did not work. This formulation however did. Thanks again! – WCB Dec 10 '14 at 19:29
  • @WCB [accept if this works.](http://stackoverflow.com/help/accepted-answer) Some languages won't support lookahead or lookebhinds but most of the languages would support negated character class `[^...]` – Avinash Raj Dec 10 '14 at 19:30
1

A regular expression is always useful to know but here the OP may not always need it. In this particular case, a combination of the functions strpos() and substr() will mostly do the trick.

For example:

. clear 

input str50 adr
"1000 Currie AV Apt: Minneapolis MN 55403"
"1843 Polk ST NE Apt: b"
"1801 3 AV S Apt: 203 Minneapolis MN 55404"
"2900 Thomas AV S Apt: 1618 MPLS MN 55416"
"8409 Elliott AV S Apt: Bloomington MN 55420"
end


. generate adr2 =  substr(adr, 1, strpos(adr, ":") - 5) + ///
                   substr(adr, strpos(adr, ":") + 1, .)

. list

   +--------------------------------------------------------------------------------------+
   |                                         adr                                     adr2 |
   |--------------------------------------------------------------------------------------|
1. |    1000 Currie AV Apt: Minneapolis MN 55403      1000 Currie AV Minneapolis MN 55403 |
2. |                      1843 Polk ST NE Apt: b                        1843 Polk ST NE b |
3. |   1801 3 AV S Apt: 203 Minneapolis MN 55404     1801 3 AV S 203 Minneapolis MN 55404 |
4. |    2900 Thomas AV S Apt: 1618 MPLS MN 55416      2900 Thomas AV S 1618 MPLS MN 55416 |
5. | 8409 Elliott AV S Apt: Bloomington MN 55420   8409 Elliott AV S Bloomington MN 55420 |
   +--------------------------------------------------------------------------------------+

The idea is to use the : as a reference point in order to eliminate the sub-string Apt: from each address, since its length is always constant.


EDIT:

@Nick Cox provides a similar but even more succinct solution:

generate adr3 = subinstr(adr, "Apt: ", "", .)

This simply replaces all instances of Apt: with "".

0

Try doing this (substitution) :

s/Apt:.*?(?=[A-Z])//g

This is usable with languages using perl or pcre regex.

  • s/// is the basic substitution skeleton
  • Apt: litteral...
  • .*? anything (non greedy)...
  • (?=[A-Z]) look-around regex technique to match an UPPER character but excluded from the match
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

I think your regex should be something like this:

.*(Apt:.*?)([A-Z]).* 

And your code like this:

regexr(address_home1, ".*(Apt:.*?)([A-Z]).*", " ")
davidahines
  • 3,976
  • 16
  • 53
  • 87
0

Stata's regex is not very sophisticated and I'm no regex expert, but this gets you close:

clear
set more off

*----- example data set -----

input ///
str30 adr
"1000 Currie AV Apt: Minneapolis MN 55403"
"1843 Polk ST NE Apt: b"
"1801 3 AV S Apt: 203 Minneapolis MN 55404"
"2900 Thomas AV S Apt: 1618 MPLS MN 55416"
"8409 Elliott AV S Apt: Bloomington MN 55420"
end

list

*----- what you want -----

gen adr2 = itrim(regexr(adr, "(Apt: *)([a-z0-9]*)", ""))

list

Resulting in:

. list

     +------------------------------------------------------------+
     |                            adr                        adr2 |
     |------------------------------------------------------------|
  1. | 1000 Currie AV Apt: Minneapoli   1000 Currie AV Minneapoli |
  2. |         1843 Polk ST NE Apt: b            1843 Polk ST NE  |
  3. | 1801 3 AV S Apt: 203 Minneapol       1801 3 AV S Minneapol |
  4. | 2900 Thomas AV S Apt: 1618 MPL        2900 Thomas AV S MPL |
  5. | 8409 Elliott AV S Apt: Bloomin   8409 Elliott AV S Bloomin |
     +------------------------------------------------------------+

If needed, you can use further string functions like trim(). See help string functions.

Roberto Ferrer
  • 11,024
  • 1
  • 21
  • 23