1

I'm trying to extract some prices from the ikea website but the price format is pretty messy (whitespace, carriage return, a comma in the middle of nowhere). This is what I extracted :

        39,90 €
                            ,

I used Scrapy to do this, so far no problem, except that I would like to get rid of all of what is not the price (and the euro symbol) !

I tried to use this regex (in python 2.7) :

re(\S[0-9]+([ ,]?[ ])([0-9]{2}?)u"\u20AC")

I'm new in programming and I learned what is a regular expression this afternoon, but I tried a massive number of possibilities without getting any better results than :

SyntaxError: unexpected character after line continuation character

If someone could take few minutes to look at what I did and tells me where I'm wrong, that would be great !

Cheers everyone

Bobafotz
  • 65
  • 1
  • 7
  • 1
    Show the `repr()` of the data you extracted, to more clearly see the sequence of characters in the data. – Mark Tolonen Dec 05 '15 at 18:56
  • Here is the result : [u'\\r\\n\\t\\t\\t39, 90 \\u20ac\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t', u'\\r\\n\\t\\t\\t\\t\\t\\t\\t\\r\\n\\t\\t\\t\\t\\t\\t'] – Bobafotz Dec 05 '15 at 20:24
  • Instead of removing all the unneeded characters, why not capturing only the prices your are interested in? Are they consistent, i.e. always in the form of `12.34 €`? – Code Different Dec 05 '15 at 20:27
  • Yep they are, always the same form and unit. How could I do that ? – Bobafotz Dec 05 '15 at 20:29

1 Answers1

1

What type of strings you are trying to match unicode or byte?

Suppose you are working with unicode strings then your match could look like:

#!/usr/bin/python
import re

s = u"""        39,90 \u20AC
                  """
groups = re.match(ur'\D*(\d+)\D*(\d{0,2})\D*(\u20AC)', s, re.UNICODE)
print groups.groups()

output:

(u'39', u'90', u'\u20ac')

u in front of strings indicates that this is unicode string.

Regex explained:

  1. \D* - anything that is non digit zero or more times
  2. (\d+) - one or more digits
  3. \D* - ...
  4. (\d{0,2}) - zero or two digits
  5. \D* - ...
  6. (\u20AC) - unicode currency symbol

We use \D, \d along with re.UNICODE flag so that everything that in unicode is interpreted as digit or non digit would be matched.

If you use byte strings. I assume that you are working with utf-8 byte strings. Then:

import re

s = b"""        39,90 \xE2\x82\xAC
                  """

groups = re.match(r'\D*(\d+)\D*(\d{0,2})\D*(\xE2\x82\xAC)', s)
print groups.groups()

output:

('39', '90', '\xe2\x82\xac')

"\xe2\x82\xac" is "e282ac" byte sequence that in utf-8 encoding means euro sign.

Good practise called "Unicode sandwich":

  1. Decode bytes to unicode on input
  2. Work only with unicode
  3. Encode unicode to bytes on output
Žilvinas Rudžionis
  • 1,954
  • 20
  • 28
  • 1
    That's great, the Unicode version works fine, even if there is still some remaining comma, removing them will train me to deal with regex ! Thank you ! – Bobafotz Dec 05 '15 at 21:04