1

I'm trying to parse the result of a whois query. I'm interested in retrieving the route, descr and origin fields as shown below:

route:          129.45.67.8/91
descr:          FOO-BAR
descr:          Information 2
origin:         AS5462
notify:         foo@bar.net
mnt-by:         AS5462-MNT
remarks:        For abuse notifications please file an online case @ http://www.foo.com/bar
changed:        foo@bar.net 20000101
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.foo.net/bar
remarks:        ****************************

route:          123.45.67.8/91
descr:          FOO-BAR
origin:         AS3269
mnt-by:         BAR-BAZ
changed:        foo@bar.net 20000101
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

To do so I use the following code and regex:

search = "FOO-BAR"

with open(FILE, "r") as f:
    content = f.read()
    r = re.compile(r'route:\s+(.*)\ndescr:\s+(.*' + search + '.*).*\norigin:\s+(.*)', re.IGNORECASE)
    res = r.findall(content)
    print res

It does work as expected with result containing only one descr field, however it ignores results containing multiple descr field.

I get the following result in this case:

[('123.45.67.8/91', 'FOO-BAR', 'AS3269')]

The expected result is to have the route field, first descr field in case of multiple descr line and origin field.

[('129.45.67.8/91', 'FOO-BAR', 'AS5462'), ('123.45.67.8/91', 'FOO-BAR', 'AS3269')]

What would be the correct regex to parse the results containing one AND several descr line?

Moustache
  • 151
  • 6
  • 2
    What's wrong with https://code.google.com/p/pywhois/ ? – OldTinfoil Aug 21 '13 at 11:58
  • regex seems a bit overkill for this task. What about linestartwith() and some counters ? – lucasg Aug 21 '13 at 11:59
  • @IntrepidBrit My understanding is pywhois produces parsed WHOIS data from a given domain name, where in this particular case WHOIS data are produced from a free text search. – Moustache Aug 21 '13 at 12:08
  • @georgesl I might look at linestartwith(), I didn't know about it however I'd like to have a solution using regex as well. – Moustache Aug 21 '13 at 12:09
  • I completely rewrite the question, hopefully it will be on-topic now, thanks – Moustache Aug 21 '13 at 14:09
  • Split string by route or \n\n and then use your regex for each substring. – Mehraban Aug 22 '13 at 07:20
  • @Moustache - that's okay. Just wanted to double check you weren't needlessly re-inventing the wheel :) – OldTinfoil Aug 22 '13 at 10:52
  • @Moustache - You're correct in your pywhois understanding, but I was just making you aware of it's existence. You might have been able to modify your program "upstream" :) – OldTinfoil Sep 02 '13 at 12:13

1 Answers1

2

I've come quite close from what you ask :

import re

search = "FOO-BAR"

with open('whois', "r") as f:
    content = f.read()
    r = re.compile(     r''                 # 
            'route:\s+(.*)\n'               # 
            '(descr:\s+(?!FOO-BAR).*\n)*'   # Capture 0-n lines with descr: field but without FOO-BAR 
            'descr:\s+(FOO-BAR)\n'          # Capture at least one line with descr: and FOO-BAR
            '(descr:\s+(?!FOO-BAR).*\n)*'   # Capture 0-n lines with descr: field but without FOO-BAR
            'origin:\s+(.*)',               #
            re.IGNORECASE)  
    #r = re.compile('(route:\n)((descr:)(?!FOO-BAR)(.*)\n)*((descr:)(FOO-BAR)\n)?((descr:)(?!FOO-BAR)(.*)\n)*')
    res = r.findall(content)
    print res

The result :

>>> [('129.45.67.8/91', '', 'FOO-BAR', 'descr:          Information 2\n', 'AS5462'),
     ('123.45.67.8/91', '', 'FOO-BAR', '', 'AS3269')]

With a bit of cleaning, you can get your result

lucasg
  • 10,734
  • 4
  • 35
  • 57