1

I don't know what I am doing wrong. I am trying to extract text and store it in a list. In firebug and firepath when I enter the path it shows exact correct text. But when I apply it returns empty list. I am Trying to scrape www.insider.in/mumbai. it goes to all the links and scrape the event title,address and other information. Here is my new edited code:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from scrapy.selector import HtmlXPathSelector

import time
import requests
import csv


class insiderSpider(BaseSpider):
    name = 'insider'
    allowed_domains = ["insider.in"]
    start_urls = ["http://www.insider.in/mumbai/"]

    def parse(self,response):
        driver = webdriver.Firefox()
        print response.url
        driver.get(response.url)
        s = Selector(response)
        #hxs = HtmlXPathSelector(response)
        source_link = []
        temp = []

        title =""
        Price = ""
        Venue_name = ""
        Venue_address = ""
        description = ""
        event_details = []
        alllinks = s.xpath('//div[@class="bottom-details-right"]//a/@href').extract()
        print alllinks
        length_of_alllinks = len(alllinks)
        for single_event in range(1,length_of_alllinks):
            if "https://insider.in/event" in alllinks[single_event]:
                source_link.append(alllinks[single_event])
                driver.get(alllinks[single_event])
                s = Selector(response)
                #hxs = HtmlXPathSelector(response)
                time.sleep(3)
                title = s.xpath('//div[@class = "cell-title in-headerTitle"]/h1//text()').extract()
                print title

                temp = s.xpath('//div[@class = "cell-caption centered in-header"]//h3//text()').extract()

                print temp
                time.sleep(2)
                a = len(s.xpath('//div[@class = "bold-caption price"]//text()').extract())
                if a > 0:
                    Price = s.xpath('//div[@class = "bold-caption price"]//text()').extract()

                    time.sleep(2)
                else:
                    Price = "RSVP"
                    time.sleep(2)
                print Price
                Venue_name = s.xpath('//div[@class = "address"]//div[@class = "section-title"]//text()').extract()

                print Venue_name
                Venue_address = s.xpath('//div[@class ="address"]//div//text()[preceding-sibling::br]').extract()

                print Venue_address
                description = s.xpath('//div[@class="cell-caption accordion-padding"]//text()').extract()

                print description
                time.sleep(5)
                event_details.append([title,temp,Price,Venue_name,Venue_address,description])
            else:
                print "Other part"

Edited Output:

[u'https://insider.in/weekender-music-festival-2015', u'https://insider.in/event/east-india-comedy-presents-back-benchers#', u'https://insider.in/event/art-of-story-telling', u'https://insider.in/feelings-in-india-with-kanan-gill', u'https://insider.in/event/the-tall-tales-workshop-capture-your-story', u'https://insider.in/halloween-by-the-pier-2015', u'https://insider.in/event/whats-your-story', u'https://insider.in/event/beyond-contemporary-art']
2015-08-03 12:53:29 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:60924/hub/session/f675b909-5515-41d4-a89e-d197c296023d/url {"url": "https://insider.in/event/east-india-comedy-presents-back-benchers#", "sessionId": "f675b909-5515-41d4-a89e-d197c296023d"}
2015-08-03 12:53:29 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request

[]

[]

RSVP

[]

[]

[]
[[[], [], 'RSVP', [], [], []]]

Even the if condition fails and prints RSVP. I don't seem to understand what I am doing wrong. I am stuck in this part since 3 days. Please help.

Arpit Agarwal
  • 517
  • 1
  • 5
  • 17

1 Answers1

1

I removed things like webdriver and got a basic code that works

import scrapy
import logging
from scrapy.http import Request
from scrapy.selector import Selector

class insiderSpider(scrapy.Spider):
    name = 'insider'
    allowed_domains = ["insider.in"]
    start_urls = ["http://www.insider.in/mumbai/"]
    event_details = list() # Changed. Now event_detail is a menber data of class

    def parse(self, response):
        source_link = []
        temp = []
        title =""
        Price = ""
        Venue_name = ""
        Venue_address = ""
        description = ""
        alllinks = response.xpath('//div[@class="bottom-details-right"]//a/@href').extract()
        print alllinks
        for single_event in alllinks:
            if "https://insider.in/event" in single_event:
                yield Request(url = single_event, callback = self.parse_event)
            else:
                print 'Other part'

    def parse_event(self, response):
        title = response.xpath('//div[@class = "cell-title in-headerTitle"]/h1//text()').extract()
        print title
        temp = response.xpath('//div[@class = "cell-caption centered in-header"]//h3//text()').extract()
        print temp
        a = len(response.xpath('//div[@class = "bold-caption price"]//text()').extract())
        if a > 0:
            Price = response.xpath('//div[@class = "bold-caption price"]//text()').extract()
        else:
            Price = "RSVP"
        print Price
        Venue_name = response.xpath('normalize-space(//div[@class = "address"]//div[@class = "section-title"]//text())').extract()

        print Venue_name
        Venue_address = response.xpath('normalize-space(//div[@class ="address"]//div//text()[preceding-sibling::br])').extract()

        print Venue_address
        description = response.xpath('normalize-space(//div[@class="cell-caption accordion-padding"]//text())').extract()

        print description
        self.event_details.append([title,temp,Price,Venue_name,Venue_address,description]) # Notice that event_details is used as self.event_details ie, using member data
        print self.event_details # Here also self.event_details
sudo bangbang
  • 27,127
  • 11
  • 75
  • 77
  • Yes It is working, but at the end when I print the event details it just shows me the details of the event page which was last executed. It's a list so it should show all the events, or correct me if I am wrong. – Arpit Agarwal Aug 03 '15 at 13:14
  • 1
    Sorry that was an awful mistake from my side I should have implemented event_detail as a member data of the class. I'll fix it right away. Thanks for pointing it out – sudo bangbang Aug 03 '15 at 13:23
  • Why list didn't get populated with everything the first time was because the list event_details was defined in the scope of function parse_event – sudo bangbang Aug 03 '15 at 13:39
  • I alao hav to write 2nd parsefunction to handle that else other part. – Arpit Agarwal Aug 03 '15 at 14:01
  • @sarahjones, Its beyond the point but why do you use length_of_alllinks = len(alllinks) for single_event in range(1,length_of_alllinks): Is it to skip first element of extracted links? – sudo bangbang Aug 03 '15 at 14:01
  • Yes it was to skip the frst link. But no worries in else part i will again visit that link and then again extract all the links untill it matches https:www.insider.in/event – Arpit Agarwal Aug 03 '15 at 14:04
  • for url in alllinks[1:]: # I think this is a better way to do it. Its more pythonic – sudo bangbang Aug 03 '15 at 14:05
  • Thank you for the insight. I don't have much reputation so I don't have the privilege to communicate personally. But is there any way to contact you if I am in doubt in python and crawling? – Arpit Agarwal Aug 03 '15 at 14:08
  • The event_list is still not getting populated. After every run it just shows me the details of one event only. – Arpit Agarwal Aug 04 '15 at 07:09
  • I've added comments where I have changed the code. Try changing the code accordingly. I've changed lines where event_details is used so you can search them and use accordingly – sudo bangbang Aug 04 '15 at 08:06