Scraping Value after Euro Symbol (Scrapy-Python)

Question

i need the a selector to scrape the value after the euro symbol (\u20ac).

<Selector xpath='//*[@class="col-sm-4"]/text()' data=u'\r\n\t\t            \u20ac 30.000,00'>

I tried dozens of variations that i have found here on stackoverflow and elsewere but i cant get it.

Sides like https://regexr.com/ show me that something like this:

response.xpath('//*[@class="col-sm-4"]/text()').re('(\u20ac).\d*.\d*.\d*')

should work, but it doesnt.

EDIT: Here a example link of Data that i would like to scrape: https://www.firmenabc.at/manfred-jungwirth-montagen_MoKY

Would appreciate help!

Michael

Hope my answer helped you, if it id, please mark the answer as correct :) — alexisdevarennes, Nov 11 '17 at 19:33
Can you provide here a chunk of elements to work with. You neither reveal the link nor any sufficient resources to work on it. It's hard to answer without testing it practically. — SIM, Nov 12 '17 at 11:14
your right Shahin - i updated a sample link - also here: https://www.firmenabc.at/manfred-jungwirth-montagen_MoKY — Michael, Nov 12 '17 at 11:53
Thanks for updating your question to bring the clarity. However, I can't find any amount on that page which is close to 30.000,00;rather, this is what I can see € 150,000.00. Lead me to find the location of that amount first. Thanks. — SIM, Nov 12 '17 at 15:53
this firm was just an example, as the one with € 30.000,00 - so a hint for scraping the € amount of this firm is appreciated as for every other firm of firmenabc.at. regards — Michael, Nov 12 '17 at 19:16

alexisdevarennes · Answer 1 · 2017-11-11T19:27:44.963

Here is the regex you are looking for. If you want to match \u20ac literally you need to prefix it with a \, the following variant: \u20ac|\\u20ac will match both € and \u20ac:

(\u20ac|\\u20ac)\s+.\d*.\d*.\d*

Missing was also a \s+. \s specifies you want to match a white space, \s+ specifies you want to match multiple white space (notice there is white space between \u20ac and the value, 30.000,00)

Notice though that this will capture only the € symbol (capture groups are composed of closed parentheses (), i.e. (ANYTHING BETWEEN THIS WILL BE CAPTURED)

So I believe what you want is:

\u20ac|\\u20ac\s+(\d*.*) - Here, we're surrounding .\d*.* with () therefore capturing that value instead of the € symbol.

Repeating .\d* is redundant, you already indicated you want to match every ocassion of it by writing it previously: \d and suffixing it a *.

Lastly, I recommend you play around with regex using https://www.regex101.com - It's a great tool and will save you a lot of headache.

i learned a lot from your answer - thanks for that! - unfortunately "\u20ac|\\u20ac\s+(\d*.*)" - or variations, arent working. Maybe it has something to do with the withespaces before the € symbol? (\r\n\t\t \u20ac...) — Michael, Nov 11 '17 at 20:00

score 0 · Accepted Answer · answered Nov 13 '17 at 15:52

0

Try this:

response.xpath('//*[@class="col-sm-4"]/text()').re(u'\u20ac\s*(\d+[\d\.,]+)')

answered Nov 13 '17 at 15:52

Wilfredo

1,548
1
9
9

Scraping Value after Euro Symbol (Scrapy-Python)

2 Answers2