0

So, I've built a regex which follows this:

4!a2!a2!c[3!c]

which is translated to

  • 4 alpha character followed by
  • 2 alpha characters followed by
  • 2 characters followed by
  • 3 optional character

this is a standard format for SWIFT BIC code HSBCGB2LXXX

my regex to pull this out of string is:

(?<=:32[^:]:)(([a-zA-Z]{4}[a-zA-Z]{2})[0-9][a-zA-Z]{1}[X]{3})

Now this is targeting a specific tag (32) and works, however, I'm not sure if it's the cleanest, plus if there are any characters before H then it fails.

the string being matched against is:

:32B:HsBfGB4LXXXHELLO

the following returns HSBCGB4LXXX, but this:

:32B:2HsBfGB4LXXXHELLO

returns nothing.

EDIT

For clarity. I have a string which contains multiple lines all starting with :2xnumber:optional letter (eg, :58A:) i want to specify a line to start matching in and return a BIC from anywhere in the line.

EDIT Some more example data to help:

:20:ABCDERF  Z
:23B:CRED
:32A:140310AUD2120,
:33B:AUD2120,
:50K:/111222333                        
Mr Bank of Dad              
Dads house
England            
:52D:/DBEL02010987654321
address 1 
address 2
:53B:/HSBCGB2LXXX
:57A://AU124040
AREFERENCE
:59:/44556677
A line which HSBCGB2LXXX contains a BIC
:70:Another line of data
:71A:Even more

Ok, so I need to pass in as a variable the tag 53 or 59 and return the BIC HSBCGB2LXXX only!

aff
  • 162
  • 2
  • 6
  • 17
CSharpNewBee
  • 1,951
  • 6
  • 28
  • 64
  • 1
    It is unclear why are you using colon and lookbehind here. – anubhava Mar 12 '14 at 12:13
  • Your second string has a number after the colon, but only letters are allowed. – Barmar Mar 12 '14 at 12:16
  • 1
    The match from the first input `HSBCGB4LXXX` can't be found in the input `:32B:HsBfGB4LXXXHELLO`. Please correct the typo(s) in your question. – Bohemian Mar 12 '14 at 12:18
  • @Barmar i was trying to cater for any character/number before the H – CSharpNewBee Mar 12 '14 at 12:18
  • @anubhava, i was working off an existing pattern I have, and was trying to adapt it. Perhaps this isn't the best approach, hence why I posted. – CSharpNewBee Mar 12 '14 at 12:19
  • @Bohemian it's not a typo. I was trying to show that anything could appear before the letter H – CSharpNewBee Mar 12 '14 at 12:20
  • Where is `H` in your original specification? And why do you only allow 3 `X` at the end, when the specification says `3 optional character`? Doesn't that mean any character? – Barmar Mar 12 '14 at 12:20
  • @CSharpNewBee, in your example the string being matched is :32B:HsBfGB4LXXXHELLO, while the return is HSBCGB4LXXX. This is definitely a typo. – elias Mar 12 '14 at 12:27
  • What language are you using? When you say "53 or 59' do you mean your calling code specifies two alternate values that can be used to match? If so, why not just call twice with different values and check which one finds a match? – Bohemian Mar 12 '14 at 21:23
  • single value @Bohemian, so on each iteration of my main source file, i'd pass in say, 59, which should find the BIC in any part of that string, then say 70 or whatever in the next run – CSharpNewBee Mar 12 '14 at 22:29
  • Please tell us which language or tool you are using! Eg java, perl, bash, python, .net, c#, ruby, whatever. A proper answer, and to an extent even the regex, depends a lot on the language. – Bohemian Mar 12 '14 at 23:39

2 Answers2

2

Your regex can be simplified, and corrected to allow a character before the H, to:

:32[^:]:.?([a-zA-Z]{6}\d[a-zA-Z]XXX)

The changes made were:

  • Lost the look behind - just make it part of the match
  • Inserting .? meaning "optional character"
  • ([a-zA-Z]{4}[a-zA-Z]{2}) ==> [a-zA-Z]{6} (4+2=6)
  • [0-9] ==> \d (\d means "any digit")
  • [X]{3} ==> XXX (just easier to read and less characters)

Group 1 of the match contains your target

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • this includes the character before the H – CSharpNewBee Mar 12 '14 at 12:20
  • @CSharpNewBee try it now - use group 1 of the match – Bohemian Mar 12 '14 at 12:24
  • smashing, can this be achieved without using Groups? – CSharpNewBee Mar 12 '14 at 12:26
  • also, if i place anything before the H, then it returns nothing – CSharpNewBee Mar 12 '14 at 12:26
  • yes, but let me ask, you've added an edit to your question - is the "32" constant, or can it be any two digits? And, should this only match when at the start of a line? Also, what tool/language are you using to execute your regex? Finally, with regex, it really helps if you should several input examples and what the expected target is, and some examples of non-matching input. – Bohemian Mar 12 '14 at 12:27
  • what about the other questions I asked? Can you provide more info? – Bohemian Mar 12 '14 at 13:16
  • Sorry Bohemian, got side tracked. So, the BIC could be anywhere in a string, it doesn't necessarily have to be at the start of the line. I am placing all this in C#. I am using Rubular to process your examples – CSharpNewBee Mar 12 '14 at 13:22
  • See above changes Bohemian, sorry for the delay – CSharpNewBee Mar 12 '14 at 15:43
0

I'm not quite sure if I understand your question completely, as your regular expression does not completely match what you have described above it. For example, you mentioned 3 optional characters, but in the regexp you use 3 mandatory X-es.

However, the actual regular expression can be further cleaned:

  • instead of [a-zA-Z]{4}[a-zA-Z]{2}, you can simply use [a-zA-Z]{6}, and the grouping parentheses around this might be unnecessary;
  • the {1} can be left out without any change in the result;
  • the X does not need surrounding brackets.

All in all (?<=:32[^:]:)([a-zA-Z]{6}[0-9][a-zA-Z]X{3}) is shorter and matches in the very same cases.

If you give a better description of the domain, probably further improvements are also possible.

elias
  • 849
  • 13
  • 28
  • thanks, your edit clears up the part before the semicolon, but there is still some confusion about the 3 end characters being optional, and about the previous two characters being arbitrary alphanumeric or a number followed by an alphabetic – elias Mar 12 '14 at 12:35
  • so what can i come up with now is: :\d\d\w?:([a-zA-Z]{6}[0-9][a-zA-Z]X{3}) – elias Mar 12 '14 at 12:36
  • also, as usually with regular expressions, it would help a lot to know, what it should NOT match. depending on the full domain, it might be, that the whole part before the second semicolon can be left out. – elias Mar 12 '14 at 12:42