1

Assuming a post code is in the form A0A 0AA, or A0 0AA where A is any letter and 0 is any number i have written the following sed script to search a web page for a post code.

s/\(([[:alnum:]]\{2,4\})\) \(([[:alnum:]]\{3\})\)/\1 \2/p

To store the first part (A0A) in the first region and second part (0AA) in the second region. then printing out what is found. However running this is currently not finding any postcodes.

Any ideas? thanks

Martin Ellis
  • 9,603
  • 42
  • 53
BradStevenson
  • 1,974
  • 7
  • 26
  • 40
  • As a general tip, I'd recommend that you start building more complex regexes by constructing and testing the individual parts, verifying they work and then putting the whole thing together. In this case, that would mean trying to match **A0-or-A0A**, then **0AA**, then putting them together. – itsbruce Nov 08 '12 at 14:47
  • Odd question. Your profile says UK, but the format you give doesn't adequately describe UK postcodes. – Martin Ellis Nov 08 '12 at 14:51

3 Answers3

2

I realise you're asking about a subset of valid postcodes, but I hope this solution for UK postcodes will help. I'd approach the problem like this:

Looking at the format for post-codes, the formats are

  • A9 9AA
  • A99 9AA
  • AA9 9AA
  • AA99 9AA
  • A9A 9AA
  • AA9A 9AA

A regex for the last part is easy: [0-9][A-Z]{2}

The first part is tricker. I'd split the problem into two:

  • The first four patterns above can be matched using [A-Z]{1,2}[0-9]{1,2}, i.e. one or two letters followed by one or two digits;
  • The last two patterns can be matched using [A-Z]{1,2}[0-9][A-Z], i.e. one or two letters, then a digit and a letter.

Putting it all together:

sed -rn 's/.*(([A-Z]{1,2}[0-9]{1,2}|[A-Z]{1,2}[0-9][A-Z]) [0-9][A-Z]{2}).*/\1/p'
Martin Ellis
  • 9,603
  • 42
  • 53
0

It's hard to find something right with your regex.

  1. What are the inner, unescaped parentheses there for? Because they are unescaped, they are literally matched. They serve no purpose, in any case.
  2. Why are you trying to match two [:alnum:] blocks when your actual pattern requires [:alpha:] in some places and [:digit:] in others?
  3. Why {2,4}? You want two or three, not two, three or four. What you actually want is either letter-number-letter or letter-number.
  4. Because you don't specify word boundaries, even if you fix your regex, the first pattern will match A0 at the end of a word and the second pattern will match 0AA at the beginning of the word.

You need to, at minimum

  1. Drop the inner parentheses
  2. Change the {2,4} to {2,3}
  3. Add word boundary matches at the beginning and end of the regex

However, this will still not properly satisfy your requirements. It will match invalid patterns. What you really need to do is

  1. Drop the inner parentheses
  2. Change the first pattern to match either [:alpha:][:digit:] or [:alpha:][:digit:][:alpha:] (there are two ways to do this).
  3. Change the second pattern to match [:digit:][:alpha:][:alpha:]
  4. Add word boundary matches at the beginning and end of the regex.

I didn't give a concrete example of how to do this because you asked for "any ideas". I'm assuming you want to try and fix this yourself given the right pointers.

itsbruce
  • 4,825
  • 26
  • 35
  • Glad you picked up on the fact i wanted to try work it out for myself as much as possible, best way to learn i find. following these pointers i ended up with s/\(.*\) \([[:alpha:]]\{1,2\}\)\([[:digit:]]\{1,2\}[[:alpha:]]\{,1\}\)[[:space:]]\([[:digit:]]\)\([[:alpha:]]\{2\}\)\(.*\)/\2\3 \4\5 Thanks. – BradStevenson Nov 08 '12 at 15:58
0

It looks like you have some problems with your brackets. The following works for me:

$ sed -n 's/.*\b\([[:alnum:]]\{2,3\}\) \([[:alnum:]]\{3\}\)\b.*/\1 \2/p' <<< "here is a postcode: A0A 0AA. some more text"
A0A 0AA
dogbane
  • 266,786
  • 75
  • 396
  • 414
  • You've also fixed the {2,4} problem that I highlighted. You should point that out in your answer, or the OP might not notice and still be stuck. You haven't fixed the problem that the regex will generate many false matches, but then that's not the problem we were asked to fix, so that's fair. – itsbruce Nov 08 '12 at 14:51