-2

I'm using urllib2 and sre in Python to parse data from aprs.fi so I can use weather data in some real time high altitude balloon code I'm working on. The parsing code is pretty simple:

import urllib2
import sre

APRStracking = urllib2.urlopen( "http://api.aprs.fi/api/get?name=KD8REX&what=loc&apikey=42457.M4AFa3hdkXG31&format=xml" )

APRSxml = APRStracking.read()

latitude = sre.findall( '<la.*>(.*)</la.*>', APRSxml )
print latitude

The data I'm trying to parse is an XML, which looks like:

<xml>
   <command>get</command>
   <result>ok</result>
   <what>loc</what>
   <found>1</found>
   <entries>
      <entry>
         <name>KD8REX</name>
         <type>l</type>
         <time>1339339410</time>
         <lasttime>1339339410</lasttime>
         <lat>41.95550</lat>
         <lng>-83.65567</lng>
         <altitude>2204.62</altitude>
         <course>15</course>
         <speed>15</speed>
         <symbol>/O</symbol>
         <srccall>KD8REX</srccall>
         <dstcall>APT311</dstcall>
         <status>UofM H.A.S. - Go Blue!</status>
         <status_lasttime>1339339600</status_lasttime>
         <path>WIDE1-1,WIDE3-3,qAR,W8SGZ</path>
      </entry>
   </entries>
</xml>

I'm not terribly familiar with Python, but my understanding of ser.findall() is that it looks through APRSxml looking for any strings that match the regular expression, and then appends whatever is between the parentheses in list "latitude." So, in this example, the two values that match the regular expression are "lasttime" and "lat." However, when I run this code it only outputs the <lat> value, not <lasttime>. Frankly, that's all I really need for my code to work, but out of curiosity, I'd appreciate if anyone could tell me why it isn't behaving as expected. Thanks.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • 2
    It produces the expected output for me. Also, why are you using `sre` instead of `re`? – BrenBarn Jun 26 '12 at 03:39
  • I actually just realized I'm using a deprecated module haha... But regardless after changing to re I'm still only getting the lat value. – ricewhite Jun 26 '12 at 03:45
  • You should check that the value of APRSxml is what you think it is (by `print`ing it, maybe). As I said, when I run your code `findall` returns two values. – BrenBarn Jun 26 '12 at 03:47
  • I confirmed after printing that and are both there. And strangely enough, I changed the regular expression to '(,*)', thinking that there was no way lat would match it, while lasttime should. But printing latitude still gave me the value. This seems to contradict everything I know about regular expressions. – ricewhite Jun 26 '12 at 03:52
  • I first tried the regex copy/pasting the data from the question and had the same results as @BrenBarn. But running the code including the retrieval of the data actually gives the results ricewhite reports. – Junuxx Jun 26 '12 at 04:05
  • Okay, so that means the problem has nothing to with the regular expression, but with the data returned by urllib. – BrenBarn Jun 26 '12 at 04:11
  • I see the problem. The problem is that the file does not contain the data you pasted. It contains the same XML with no whitespace. This is obviously vital when parsing using regular expressions. The lesson is, paste the data you are using, not the data you wish you were using. – BrenBarn Jun 26 '12 at 04:12

4 Answers4

4

Looking at the form parameter, I noticed that you could specify form=xml. I changed it to json and look at that, you get JSON!

{
  "command":"get",
  "result":"ok",
  "what":"loc",
  "found":1,
  "entries":[
    {
      "name":"KD8REX",
      "type":"l",
      "time":"1339339410",
      "lasttime":"1339339410",
      "lat":"41.95550",
      "lng":"-83.65567",
      "altitude":"2204.62",
      "course":"15",
      "speed":"15",
      "symbol":"\/O",
      "srccall":"KD8REX",
      "dstcall":"APT311",
      "status":"UofM H.A.S. - Go Blue!",
      "status_lasttime":"1339339600",
      "path":"WIDE1-1,WIDE3-3,qAR,W8SGZ"
    }
  ]
}

It's easy to parse. Easier than XML:

import urllib2, json

url = 'http://api.aprs.fi/api/get?name=KD8REX&what=loc&apikey=42457.M4AFa3hdkXG31&format=json'
data = json.loads(urllib2.urlopen(url).read())

for entry in data['entries']:
  print 'Latitude:', entry['lat']

It's really easy to work with. data is just a Python dictionary.

Blender
  • 289,723
  • 53
  • 439
  • 496
0

You need to change the greedy stars to lazy matches (*?).

>>> re.findall('<la.*?>(.*?)</la.*?>', APRSxml )
['1339339410', '41.95550']

What currently happens is that <la.*> matches everything from the first la to the last occurence of > that still allows the rest of the expression to find a match. So, <la.*> matches

<lasttime>1339339410</lasttime><lat>

Explaining why the lasttime value isn't reported.

Junuxx
  • 14,011
  • 5
  • 41
  • 71
0

Try this non-greedy version:

latitude = re.findall('<la.*?>(.*?)</la.*?', APRSxml)
>>> print latitude
['1339339410', '41.95550']

But if you want "latitude" why not just do this?

latitude = re.findall('<lat>(.*?)<', APRSxml)
mhawke
  • 84,695
  • 9
  • 117
  • 138
  • Ahhhhh, that did the trick, thank you! I know just matching would be sufficient for what I needed but I'm not too familiar with Python, and I figure a good way to learn is just to fool around with the language. This was just a little quirk that I just couldn't figure out. – ricewhite Jun 26 '12 at 04:19
  • If you are looking for a better solution to parsing XML with regular expressions (definitely **not** recommended) take a look at what [Blender discovered](http://stackoverflow.com/a/11200728/21945) for you. – mhawke Jun 26 '12 at 04:32
0

Python includes an easy-to-use XML parser that is ideally suited to this task:

>>> import urllib2
>>> from xml.etree.ElementTree import parse
>>> APRStracking = urllib2.urlopen("http://api.aprs.fi/api/get?name=KD8REX&what=loc&apikey=42457.M4AFa3hdkXG31&format=xml")
>>> tree = parse(APRStracking)
>>> tree.find('entries/entry/lat').text
'41.95550'
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485