-1

I'm using Scrapy join with Python2.7 for do some tasks but I am deal with a issue with the Spanish characters like accents and ñ. The problem is when I did a respone like:

response.xpath("//*[contains(@id, 'content')]").css('table').css('tr').css('a')

And returns, for example, the following line:

u'<a href="/C\xf3digo/7">/C\xf3digo/7</a>'

I need the content of href for go to the next page but the format is incorrect and Scrapy cannot do a request.

I included the # -- coding: utf-8 -- at the beginning of the file and I tried to use .decode('utf-8') but it didn't work. Someone had have this problem and knowns how to solve it? I will really grateful with your help.

Regards.

g4s0l1n
  • 114
  • 2
  • 10
  • Is there an error message? – MaximTitarenko Oct 29 '17 at 11:36
  • @MaximTitarenko No, is a part of the content of the response of Scrapy – g4s0l1n Oct 29 '17 at 11:37
  • Do you use Python 2.7? – MaximTitarenko Oct 29 '17 at 11:40
  • @MaximTitarenko Yes, I also tried with Python3.6 but I get the same result – g4s0l1n Oct 29 '17 at 11:41
  • How are you telling scrapy to use that as a new request? Don't see any resason it shouldn't handle it... – Jon Clements Oct 29 '17 at 11:43
  • @ Jon Clements yield scrapy.Request(url=url, callback=self.parse) – g4s0l1n Oct 29 '17 at 11:45
  • Sure - but exactly how are you getting `url` - your example seems to show the entire anchor element - you are just extracting out the href attribute content, right? – Jon Clements Oct 29 '17 at 11:47
  • The url is like **url='https://www.somepage.com'** and I scrapping the whole page until locate the href that satisfies my needs. But the whole page is with characters like _\xf3_, _\xfa_ ... – g4s0l1n Oct 29 '17 at 11:50
  • @g4s0l1n Please can you include in your question your actual *XPath*? That comment doesn't actually seem to relate to my comment at all :) – Jon Clements Oct 29 '17 at 11:51
  • @Jon Clements I edited the post, now you can see the xpath – g4s0l1n Oct 29 '17 at 11:53
  • Right... never mind I'm not sure why all the separate `.css` calls are going on there... you want to return the href element to follow... so if you change that to be `.css('a::attr(href)')` you should get the actual url to follow and it should work – Jon Clements Oct 29 '17 at 11:55
  • @Jon Clements This is not the problem, I know how to extract the href. The problem is the encoding, the whole page is encoded – g4s0l1n Oct 29 '17 at 11:59
  • Well - bearing in mind there's no code showing that - it's not exactly an unreasonable assumption given what example you've provided... What error do you get when trying to follow the page - if you're getting the anchors and hrefs correctly, the encoding shouldn't be an issue... – Jon Clements Oct 29 '17 at 12:06
  • Duplicate of: https://stackoverflow.com/questions/9181214/scrapy-text-encoding Another suggestion I want to give. Pages don't have to be encoded in utf-8. Take a look at the page you are requesting and see if it contains a . If not it might not be possible to directly retrieve utf-8. You can anyway use the standard HTMLParser in python 2.7 to encode the html entities if you cannot find a solution. – Hielke Walinga Oct 29 '17 at 12:11
  • @Jon Clements Sorry, I include this css('a::attr(href)') and now works. But I dont understad well what was the problem – g4s0l1n Oct 29 '17 at 12:34
  • Looks like you're weren't passing the extracted hrefs to follow but the actual anchor element text itself... Hard to say as your question still hasn't been [edit]ed with information from comments though... – Jon Clements Oct 29 '17 at 12:43
  • Please just include minimal but sufficient code to reproduce the question. You may not realise but it is evident from the comments that you are not including enough information: https://stackoverflow.com/help/how-to-ask – de1 Oct 29 '17 at 13:42

1 Answers1

0

Thanks to @Jon Clements I fix it. The problem is that I was not passing the extracted hrefs to follow. The solution is:

response.xpath("//*[contains(@id, 'content')]").css('table').css('tr').css('a::attr(href)')
g4s0l1n
  • 114
  • 2
  • 10