1

i need the a selector to scrape the value after the euro symbol (\u20ac).

<Selector xpath='//*[@class="col-sm-4"]/text()' data=u'\r\n\t\t            \u20ac 30.000,00'>

I tried dozens of variations that i have found here on stackoverflow and elsewere but i cant get it.

Sides like https://regexr.com/ show me that something like this:

response.xpath('//*[@class="col-sm-4"]/text()').re('(\u20ac).\d*.\d*.\d*')

should work, but it doesnt.

EDIT: Here a example link of Data that i would like to scrape: https://www.firmenabc.at/manfred-jungwirth-montagen_MoKY

Would appreciate help!

Michael

Michael
  • 247
  • 1
  • 3
  • 10
  • Hope my answer helped you, if it id, please mark the answer as correct :) – alexisdevarennes Nov 11 '17 at 19:33
  • Can you provide here a chunk of elements to work with. You neither reveal the link nor any sufficient resources to work on it. It's hard to answer without testing it practically. – SIM Nov 12 '17 at 11:14
  • your right Shahin - i updated a sample link - also here: https://www.firmenabc.at/manfred-jungwirth-montagen_MoKY – Michael Nov 12 '17 at 11:53
  • Thanks for updating your question to bring the clarity. However, I can't find any amount on that page which is close to 30.000,00;rather, this is what I can see € 150,000.00. Lead me to find the location of that amount first. Thanks. – SIM Nov 12 '17 at 15:53
  • this firm was just an example, as the one with € 30.000,00 - so a hint for scraping the € amount of this firm is appreciated as for every other firm of firmenabc.at. regards – Michael Nov 12 '17 at 19:16

2 Answers2

1

Here is the regex you are looking for. If you want to match \u20ac literally you need to prefix it with a \, the following variant: \u20ac|\\u20ac will match both € and \u20ac:

(\u20ac|\\u20ac)\s+.\d*.\d*.\d*

Missing was also a \s+. \s specifies you want to match a white space, \s+ specifies you want to match multiple white space (notice there is white space between \u20ac and the value, 30.000,00)

Notice though that this will capture only the symbol (capture groups are composed of closed parentheses (), i.e. (ANYTHING BETWEEN THIS WILL BE CAPTURED)

So I believe what you want is:

\u20ac|\\u20ac\s+(\d*.*) - Here, we're surrounding .\d*.* with () therefore capturing that value instead of the symbol.

Repeating .\d* is redundant, you already indicated you want to match every ocassion of it by writing it previously: \d and suffixing it a *.

Lastly, I recommend you play around with regex using https://www.regex101.com - It's a great tool and will save you a lot of headache.

alexisdevarennes
  • 5,437
  • 4
  • 24
  • 38
  • i learned a lot from your answer - thanks for that! - unfortunately "\u20ac|\\u20ac\s+(\d*.*)" - or variations, arent working. Maybe it has something to do with the withespaces before the € symbol? (\r\n\t\t \u20ac...) – Michael Nov 11 '17 at 20:00
  • Try this ``\s+(\d*.*)`` – alexisdevarennes Nov 11 '17 at 20:42
0

Try this:

response.xpath('//*[@class="col-sm-4"]/text()').re(u'\u20ac\s*(\d+[\d\.,]+)')
Wilfredo
  • 1,548
  • 1
  • 9
  • 9