Scrapy utf8 encoding

Question

I'm using Scrapy join with Python2.7 for do some tasks but I am deal with a issue with the Spanish characters like accents and ñ. The problem is when I did a respone like:

response.xpath("//*[contains(@id, 'content')]").css('table').css('tr').css('a')

And returns, for example, the following line:

u'<a href="/C\xf3digo/7">/C\xf3digo/7</a>'

I need the content of href for go to the next page but the format is incorrect and Scrapy cannot do a request.

I included the # -- coding: utf-8 -- at the beginning of the file and I tried to use .decode('utf-8') but it didn't work. Someone had have this problem and knowns how to solve it? I will really grateful with your help.

Regards.

@MaximTitarenko No, is a part of the content of the response of Scrapy — g4s0l1n, Oct 29 '17 at 11:37
@MaximTitarenko Yes, I also tried with Python3.6 but I get the same result — g4s0l1n, Oct 29 '17 at 11:41
How are you telling scrapy to use that as a new request? Don't see any resason it shouldn't handle it... — Jon Clements, Oct 29 '17 at 11:43
@ Jon Clements yield scrapy.Request(url=url, callback=self.parse) — g4s0l1n, Oct 29 '17 at 11:45
Sure - but exactly how are you getting `url` - your example seems to show the entire anchor element - you are just extracting out the href attribute content, right? — Jon Clements, Oct 29 '17 at 11:47
The url is like **url='https://www.somepage.com'** and I scrapping the whole page until locate the href that satisfies my needs. But the whole page is with characters like _\xf3_, _\xfa_ ... — g4s0l1n, Oct 29 '17 at 11:50
@g4s0l1n Please can you include in your question your actual *XPath*? That comment doesn't actually seem to relate to my comment at all :) — Jon Clements, Oct 29 '17 at 11:51
Right... never mind I'm not sure why all the separate `.css` calls are going on there... you want to return the href element to follow... so if you change that to be `.css('a::attr(href)')` you should get the actual url to follow and it should work — Jon Clements, Oct 29 '17 at 11:55
@Jon Clements This is not the problem, I know how to extract the href. The problem is the encoding, the whole page is encoded — g4s0l1n, Oct 29 '17 at 11:59
Well - bearing in mind there's no code showing that - it's not exactly an unreasonable assumption given what example you've provided... What error do you get when trying to follow the page - if you're getting the anchors and hrefs correctly, the encoding shouldn't be an issue... — Jon Clements, Oct 29 '17 at 12:06
Duplicate of: https://stackoverflow.com/questions/9181214/scrapy-text-encoding Another suggestion I want to give. Pages don't have to be encoded in utf-8. Take a look at the page you are requesting and see if it contains a . If not it might not be possible to directly retrieve utf-8. You can anyway use the standard HTMLParser in python 2.7 to encode the html entities if you cannot find a solution. — Hielke Walinga, Oct 29 '17 at 12:11
@Jon Clements Sorry, I include this css('a::attr(href)') and now works. But I dont understad well what was the problem — g4s0l1n, Oct 29 '17 at 12:34
Looks like you're weren't passing the extracted hrefs to follow but the actual anchor element text itself... Hard to say as your question still hasn't been [edit]ed with information from comments though... — Jon Clements, Oct 29 '17 at 12:43
Please just include minimal but sufficient code to reproduce the question. You may not realise but it is evident from the comments that you are not including enough information: https://stackoverflow.com/help/how-to-ask — de1, Oct 29 '17 at 13:42

score 0 · Answer 1 · answered Oct 30 '17 at 06:44

0

Thanks to @Jon Clements I fix it. The problem is that I was not passing the extracted hrefs to follow. The solution is:

response.xpath("//*[contains(@id, 'content')]").css('table').css('tr').css('a::attr(href)')

answered Oct 30 '17 at 06:44

g4s0l1n

114
2
10

Scrapy utf8 encoding

1 Answers1