1

I'm having a nightmare with data scrapped with Scrapy. Currently I encode it using UTF-8 i.e detail_content.select('p/text()[1]').extract()[0].encode('utf-8') saved into a JSON file, and then the captured text is displayed again using Django and a mobile app.

In the JSON file the escaped HTML gets escaped using unicode 'blah blah \u00a34,000 blah'

Now my problem is when I try and display the text in a django template or the mobile app the actual literal characters display: \u00a3 instead of £

Should I not be storing escaped unicode in JSON? Would it be better to store ASCII in the JSON file using the JSON escaping? If so how do you go about doing this with scrapy?

Scrappy code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.item import Item, Field
import datetime
import unicodedata
import re

class Spider(BaseSpider):
    #spider stuff

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//ul[@class = "category3"]/li')
        for row in rows:
            item = Item()
            if len(row.select('div[2]/a/text()').extract()) > 0:
                item['header'] = str(row.select('div[2]/a/text()')
                                    .extract()[0].encode('utf-8'))
            else:
                item['header'] = ''
            if len(row.select('div[2]/a/text()').extract()) > 0:
                item['_id'] = str(row.select('div[2]/a/text()')
                                    .extract()[0].encode('utf-8'))
            else:
                item['_id'] = ''
            item['_id'] = self.slugify(item['_id'])[0:20]
            item_url = row.select('div[2]/a/@href').extract()
            today = datetime.datetime.now().isoformat()
            item['dateAdded'] = str(today)
            yield Request(item_url[0], meta={'item' : item},
                             callback=self.parse_item)

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        detail_content = hxs.select('//*[@id="content-area"]')
        item = response.request.meta['item']   
        item['description'] = str(detail_content.select('p/text()[1]')
                                                        .extract()[0])
        item['itemUrl'] = str(detail_content.select('//a[@title="Blah"]/@href')
                                                                 .extract()[0])
        item['image_urls'] = detail_content.select('//img[@width="418"]/../@href')
                                                                        .extract()
        print item
        return item
KingFu
  • 1,358
  • 5
  • 22
  • 42
  • Have you tried without `encode('utf-8')`? Other question, what's the output of: `detail_content.select('p/text()[1]').extract()[0]`. I mean, is it `u'blah blah'` or just `'blah blah'` – Paulo Bu Jun 20 '13 at 13:17
  • Also, how are you outputting the json in the templates? – Paulo Bu Jun 20 '13 at 13:18
  • Yeah tried without encode('utf-8'), I receive errors like: `exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 127: ordinal not in range(128)` i'm outputting the text in the django templates in the regular way i.e. `{{ item.description }}` autoescaping on or off makes no difference – KingFu Jun 20 '13 at 13:20
  • Can you post more background of your code. Normally, if you're encoding in `utf-8` the string \u00a3 shouldn't appear in the `.json` file. – Paulo Bu Jun 20 '13 at 13:27

1 Answers1

0

Ok this I find very odd:

item['header'] = str(row.select('div[2]/a/text()')
                     .extract()[0].encode('utf-8'))

Is not correct to do str(<some_value>.encode('utf-8')). That basically means you're converting a utf-8 bunch of bytes to ascii. This may yield errors when the utf-8 bytes exceed 128.

Now, I strongly believe your getting the characters from Scrappy already in unicode.

I receive errors like: exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 127: ordinal not in range(128)

So, my suggestion is to change the code to this:

item['header'] = row.select('div[2]/a/text()')
                 .extract()[0].encode('utf-8')

Just remove the str() calling. This will get the unicode received from Scrappy and turn it into utf-8. Once it is in utf-8. Be careful with string operations. Normally this conversion from unicode to a specific encoding should be done just before the writing to disk.

Note that you have this kind of code in two places. Modify them both.

UPDATE: Take a look at this, might be helpful: scrapy text encoding

Hope this helps!

Community
  • 1
  • 1
Paulo Bu
  • 29,294
  • 6
  • 74
  • 73
  • You're right about the str() it's not required. Not sure why I was using it so thank you! However I've still got the same problem. There is no unicode escape characters in the original HTML. Scrapy looks like it's taking the HTML and converts it into UTF-8 before anything else happens – KingFu Jun 20 '13 at 14:19
  • Sorry I didn't understand first. So there's no unicode scaped char in the original HTML. What Scrapy may be doing is converting it to Unicode but not `utf-8` – Paulo Bu Jun 20 '13 at 14:30
  • Do you still get \u00a3 in your .json file? – Paulo Bu Jun 20 '13 at 14:35
  • Just checked, yes it's unicode not utf-8 sorry, I only assumed it was utf-8 as I've got that my code. Yes still get \u00a3. All the non standard characters get escaped \uXXXX – KingFu Jun 20 '13 at 14:36
  • Can you post how do you write the `.json` file? – Paulo Bu Jun 20 '13 at 14:38
  • `scrapy crawl myspider -o items.json -t json` – KingFu Jun 20 '13 at 14:39
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/32084/discussion-between-paulo-bu-and-kingfu) – Paulo Bu Jun 20 '13 at 14:43
  • the link you posted has the solution, creating custom pipeline – KingFu Jun 20 '13 at 15:09