Parsing structured text in Ruby

Question

There are several questions on SO about parsing structured text in Ruby, but none of them apply to my case.

I'm the author of the Ruby Whois library. The library includes several parsers to parse a WHOIS response and extract the properties from the content.

So far, I used two approaches:

Regular expressions for base parsers (e.g. whois.aero)
StringScanner for advanced parsers (e.g. whois.nic.it)

Regular expressions are not efficient because if I need to extract 15 properties, I need to scan the same response at least 15 times.

StringScanner is a nice library, but creating an efficient scanner is not that simple.

I was wondering if is there some other Ruby tools you suggest to implement a WHOIS record parser. I was reading about Treetop but because WHOIS records lack of a specification, I believe Treetop is not the right solution.

Any suggestion?

I was doing some work on parsing whois results several years ago, and the lack of standards made me nuts. I think all you can really do is create a specific page parser for each source. It's really an ugly solution but there's no consistency in the data order or format that would allow some sort of templating or pattern matching. — the Tin Man, Feb 17 '11 at 05:44

Charlie Martin · Accepted Answer · 2011-02-17T03:12:49.953

5

The obvious one is Ragel. whois records are pretty straightforward, have a limited set of key terms and such -- it should be straightforward. And Ragel parsers have proven very efficient.

Update As promised.

Okay, so why use Ragel? Basically, anything that can be described as a finite state machine can be described in Ragel, which then generates code for a highly efficient parser. This parser is much faster than a generalized regular expression engine, simply because it has a simpler program than the general parser.

Now, you could take this further, for example by using the ABNF Generator here. Then, your description to start with could be as simple as something like

WHOIS ::= RECORD*
RECORD ::= FIELDNAME ':' FIELDVALUE
FIELDVALUE ::= NAMESTRING | IPADDRESS | DOMAINNAME

(I make no claim that's particularly ABNF syntax, just a rough BNF.) The point is that you describe the parser in a more or less intuitive form, and let the generator make the exciting code part.

edited Feb 17 '11 at 03:12

answered Feb 16 '11 at 22:46

Charlie Martin

110,348
25
193
263

1

Might be helpful to explain why this is the obvious choice. I agree, but might not be so obvious to others. Explain what he will want to do and why this is the best tool. – Jordan Dea-Mattson Feb 16 '11 at 22:51
Yeah, but I'm at the day job. I'll extend this tonight. – Charlie Martin Feb 17 '11 at 00:16
Does Ragel requires the input string to follow a strict specification? WHOIS responses are not programming languages and they don't have a well defined specification. I often have to deal with guessing and ignore entire portions of a text. Here's an example http://goo.gl/AJ8dQ – Simone Carletti Feb 17 '11 at 08:32
Well, define "strict specification". It does, but there's nothing I see there that can't be defined strictly in that sense, ie, by a finite state machine. -- Look, you're already doing the parsing with regular expressions; a regular expression is just a specification of a finite state machine. Using Ragel would just let you describe a specific FSM and implement it efficiently. – Charlie Martin Feb 17 '11 at 16:03

Parsing structured text in Ruby

1 Answers1