0

I'm trying to extract latitudes and longitudes from this part of HTML (there are two pairs of latitude/longitude and I need it to work for any number of coordinates) :

<script type="text/javascript"> 
[...]
truvo.data['map']= [{"lat":50.469585,"lon":4.487113,"id":"fr_BE_YP_PAID_16758523_0000_2840991_8600_20139917392","number":"1","display":"1","customerid":"16758523","addressid":"2840991","part":"base","type":"paid"},{"lat":50.721645,"lon":4.6253505,"id":"fr_BE_YP_PAID_12075596_0000_2315340_8600_20139200640","number":"2","display":"2","customerid":"12075596","addressid":"2315340","part":"base","type":"paid"}]
;
</script>   

I tried several methods :

how to access latitude and longtitude in a script with beautifulsoup?

How to scrape latitude longitude in beautiful soup

and every other kind of stackoverflow proposals, but nothing works.

If I use a pattern, would that one be correct ?

'("lat"|"lon"):(-?\d{1,3}\.\d+)'

Does someone have an idea?

Thanks a lot,

Marie

Community
  • 1
  • 1
MarieC
  • 37
  • 3
  • 6

1 Answers1

1

You are almost there, you need to remove - from regex

>>> re.findall(r'("lat"|"lon"):(\d{1,3}\.\d+)', data)
[('"lat":', '50.469585'),
 ('"lon":', '4.487113'),
 ('"lat":', '50.721645'),
 ('"lon":', '4.6253505')]

Or you can also try ( which already worked for you )

>>> re.findall(r'(?is)("lat":|"lon":)([0-9.]+)',data)
akash karothiya
  • 5,736
  • 1
  • 19
  • 29
  • Thanks a lot, it works on a string done with the script, but how can I extract the code from HTML in a such way that's in string format? I usually do soup.find_all('script'), so the current format is bs4.element.Tag – MarieC May 12 '17 at 09:12
  • use `str(soup.select('script'))` – akash karothiya May 12 '17 at 09:16