Parse a document with XPath

Question

I need to parse a document with a structure that I've never seen before. It looks like this:

<cat:707>
<begad:00216057>
<zip:48650>
<addr:2100 N. HURON RD, PINCONNING, MI USA>
COUNTRY 10 Mi. N. of Midland, 3 bedroom, 2 baths, appliances furnished, 300x500 finished pole barn on 5 acres,  $750/mo. + utilities, 989-965-1118.
<endad>


<cat:710>
<begad:00216094>
<zip:48640>
<addr:1100 S HOMER RD, MIDLAND, MI USA>
IMMEDIATE Occupancy, extra clean, small 2 bedroom by nature center. Pet maybe/extra $400 deposit/references 839-4552
<endad>

How would I parse something like this in php to get the info after the colons (ie: the 707 in the first cat) and the text before <endad>?

Something someone made up? Where did you get it? Any documentation on its structure? It looks like a line-oriented structure. — Francis Avila, Mar 01 '13 at 20:29
@BrianReeves Since it's not XML - you may want to consider adding an appropriate language tag so that others may be able to offer suggestions... — Jon Clements, Mar 01 '13 at 20:32
Ya it might be something someone made up. It's for classified ads. I'm sure the vendor made it up therefore making everybody's job harder. It's not line oriented. I just formatted it that way to be readable. Everything is all on one line. Thanks. — Brian Reeves, Mar 01 '13 at 20:32

score 1 · Accepted Answer · edited Mar 01 '13 at 21:15

This looks like something someone made up, but you can probably figure it out easily enough.

Here's some Python that seems to work. From here you can convert to XML and parse with XPath if you want.

import re

parse_re = (r"""
<(?P<key>\w+):(?P<value>[^>]+)>  # <key:value>
| (?<=>)\s*(?P<description>.*?)\s+<endad> #description
""", re.VERBOSE)

adparser = re.compile(*parse_re)

def getrecords(input):
    record = {}
    for match in adparser.finditer(input):
        if match.group('key'):
            record[match.group('key')] = match.group('value')
        elif match.group('description'):
            record['description'] = match.group('description')
            yield record
            record = {}

print list(getrecords(input))

I see you updated your question to specify you're using PHP. The same regular expression seems to work with pcre_* as well:

$parse_re = '/
<(?P<key>\w+):(?P<value>[^>]+)>  # <key:value>
| (?<=>)\s*(?P<description>.*?)\s+<endad> #description
/x';

preg_match_all($parse_re, $input, $matches, PREG_SET_ORDER);

var_export($matches);

Thanks! I'm going to give this a try. – Brian Reeves Mar 04 '13 at 14:26 — Brian Reeves, Mar 04 '13 at 14:26

Parse a document with XPath

1 Answers1