Firstcry.com Scraper Issues

Question

I am trying to scrape the following site - www.firstcry.com . The website uses AJAX (in the form of XHR) to display it's search results.

Now, if you see my code, the jsonresponse variable contains the json output of the website. Now, when I try to print it, it contains many \ (backslashes).

Now, if you correctly see my code just below the jsonresponse variable, I have commented several lines. Those were my attempts (which I tried after reading several similar questions, here on Stack Overflow) to remove all the backslashes and also these - u', which were also present there.

But, after all those tries, I am unable to remove ALL the backslashes and u'.

Now, if I don't remove all of those, I am not able to access the jsonresponse using it's keys, So, it is very essential for me to remove ALL of them.

Please help me resolve this issue. It would be better, if you provide a code , in particular for my case (issue), and not a general code rather!

My Code is here-:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import json , simplejson , ujson

#query=raw_input("Enter a product to search for= ")
query='bag'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.firstcry.com"]


    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp = "http://www.firstcry.com/svcs/search.svc/GetSearchPagingProducts_new?PageNo=" + str(i) + "&PageSize=20&SortExpression=Relevance&SubCatId=&BrandId=&Price=&OUTOFSTOCK=&DISCOUNT=&Q=" + query1 + "&rating="
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        return [ Request(url = start_url) for start_url in start_urls ]


    def parse(self, response):
        print response

        items = []
        jsonresponse = dict(ujson.loads(response.body_as_unicode()))
#       jsonresponse = jsonresponse.replace("\\","")
#       jsonresponse = jsonresponse.decode('string_escape')
#       jsonresponse = ("%r" % json.loads(response.body_as_unicode()))
#       d= jsonresponse.json()
        #jsonresponse = jsonresponse.strip("/")
#       print jsonresponse
#       print d
#       print json.dumps("%r" % jsonresponse, indent=4, sort_keys=True)
#       a = simplejson.dumps(simplejson.loads(response.body_as_unicode()).replace("u\'","\'"), indent=4, sort_keys=True)
        #a= json.dumps(json.JSONDecoder().decode(jsonresponse))
        #a = ujson.dumps((ujson.loads(response.body_as_unicode())) , indent=4 )
        a=json.dumps(jsonresponse, indent=4)
        a=a.decode('string_escape')
        a=(a.decode('string_escape'))
#       a.gsub('\\', '')
        #a = a.strip('/')
        #print (jsonresponse)
        print a
        #print "%r" % a
#       print "%r" % json.loads(response.body_as_unicode())

        p=(jsonresponse["hits"])["hit"]
#       print p
#       raw_input()
        for x in p:
            item = DmozItem()
            item['productname'] = str(x['title'])
            item['product_link'] = "http://www.yepme.com/Deals1.aspx?CampId="+str(x["uniqueId"])
            item['current_price']='Rs. ' + str(x["price"])

            try:            
                p=x["marketprice"]
                item['mrp'] = 'Rs. ' + str(p)

            except:
                item['mrp'] = item['current_price']

            try:            
                item['offer'] = str(x["promotionalMsg"])
            except:
                item['offer'] = str('No additional offer available')

            item['imageurl'] = "http://staticaky.yepme.com/newcampaign/"+str(x["uniqueId"])[:-1]+"/"+str(x["smallimage"])
            item['outofstock_status'] = str('In Stock')
            items.append(item)

        print (items)

spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("CONCURRENT_REQUESTS" , 100)
#)
settings.set( "DEPTH_PRIORITY" , 1)
settings.set("SCHEDULER_DISK_QUEUE" , "scrapy.squeues.PickleFifoDiskQueue")
settings.set( "SCHEDULER_MEMORY_QUEUE" , "scrapy.squeues.FifoMemoryQueue")
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

score 2 · Accepted Answer · answered Jul 09 '15 at 13:12

2

No need to get all complicated. Instead of using ujson and response.body_as_unicode() and then casting that into a dict, just use regular json and response.body:

$ scrapy shell "http://www.firstcry.com/svcs/search.svc/GetSearchPagingProducts_new?PageNo=1&PageSize=20&SortExpression=Relevence&SubCatId=&BrandId=&Price=&OUTOFSTOCK=&DISCOUNT=&Q=bag&rating="
...
>>> jsonresponse = json.loads(response.body)
>>> jsonresponse.keys()
[u'ProductResponse']

This worked just fine for me with your example. Looks like you're a bit deep into the "hacking around for an answer" mode ;)

I'll note that this line...

p=(jsonresponse["hits"])["hit"]

... won't work in your code. The only key available in jsonresponse after parsing the JSON is "ProductResponse". That key contains another JSON object, which you can then access like this:

>>> product_response = json.loads(jsonresponse['ProductResponse'])
>>> product_response['hits']['hit']
[{u'fields': {u'_score': u'56.258633',
    u'bname': u'My Milestones',
    u'brandid': u'450',
...

I think that will give you what you were looking to get in your p variable.

answered Jul 09 '15 at 13:12

JoeLinux

4,198
1
29
31

Haha! Maybe you said that right, I wanted to reach a solution! Just a simple doubt, the " u' " that exists in all the keys of product_response['hits']['hit'], won't they cause an issue to me, when I want to access them,i.e, (product_response["hits"]["hit")["fields"] won't work right? That is the reason why I wanted to remove all the " u' " as well! So, please help me remove those as well! – Ashutosh Saboo Jul 09 '15 at 13:15
The "u" just indicates that the string is a Unicode string. Those u's are not actually going to show up anywhere. The only part of the string that represents the content is enclosed within the quotes. Is there a specific reason you want to remove the u's? – JoeLinux Jul 09 '15 at 13:24
Nope. Actually I though I couldn't access the keys if the u's were present. But as far as your explaination is considered, that won't be a problem! So that's great for me! I'll just try it out once in my program and then I'll mark your answer! Thank you for all the help @JoeLinux! :) :D – Ashutosh Saboo Jul 09 '15 at 14:33
On a side note, since you didn't know about the u's, I'm guessing you're bound to run into a Unicode error at some point while developing Python, so take a look through this presentation: http://farmdev.com/talks/unicode/. That will help you figure out how to properly navigate Unicode strings in Python. I'm not a fan of slides but that link is pretty good. – JoeLinux Jul 09 '15 at 14:39
Awesome! Thanks a lot! By the way, could you help me with this question as well - http://stackoverflow.com/questions/31309631/twisted-python-failure-scrapy-issues# . It would be great of you, if you help me with this one too! Thanks a lot! :) – Ashutosh Saboo Jul 09 '15 at 17:46

Firstcry.com Scraper Issues

1 Answers1