0

Hello i am new in programming and scrapy. Trying to learn scrapy i try scrape some items. but unable to do the scrape next page item, please help how parse next link url for this web site.

Here is my code:

import scrapy
from scrapy.linkextractors import LinkExtractor 



class BdJobs(scrapy.Spider):
    name = 'jobs'
    allowed_domains = ['Jobs.com']
    start_urls = [
                  'http://jobs.com/',
                 
                  ]
    #rules=( Rule(LinkExtractor(allow()), callback='parse', follow=True))

    def parse(self, response):
        for title in response.xpath('//div[@class="job-title-text"]/a'):
            yield {
            'titles': title.xpath('./text()').extract()[0].strip()
            }

    nextPageLink:       

    for grab the next url here is the inspect Element url:
    https://08733078838609164420.googlegroups.com/attach/58c611bdb536b/bdjobs.png?part=0.1&view=1&vt=ANaJVrEDQr4PODzoOkFRO_fLhL2ZF3x-Mts4XJ8m8qb2RSX1b4n6kv0E-62A2yvw0HkBjrmUOwCrFpMBk_h8UYSWDO6hZXyt-N2brbcYwtltG-A6NiHeaGc

Here is output:


{"titles": "Senior Software Engineer (.Net)"},
{"titles": "Java programmer"},
{"titles": "VLSI Design Engineer (Japan)"},
{"titles": "Assistant Executive (Computer Lab-Evening programs)"},
{"titles": "IT Officer, Business System Management"},
{"titles": "Executive, IT"},
{"titles": "Officer, IT"},
{"titles": "Laravel PHP Developer"},
{"titles": "Executive - IT (EDISON Footwear)"},
{"titles": "Software Engineer (PHP/ MySQL)"},
{"titles": "Software Engineer [Back End]"},
{"titles": "Full Stack Developer"},
{"titles": "Mobile Application Developer (iOS/ Android)"},
{"titles": "Head of IT Security Operations"},
{"titles": "Database Administrator, Senior Analyst"},
{"titles": "Infrastructure Delivery Senior Analyst, Network Security"},
{"titles": "Head of IT Support Operations"},
{"titles": "Hardware Engineer"},
{"titles": "JavaScript/ Coffee Script Programmer"},
{"titles": "Trainer - Auto CAD"},
{"titles": "ASSISTENT PRODUCTION OFFICER"},
{"titles": "Customer Relationship Executive"},
{"titles": "Head of Sales"},
{"titles": "Sample Master"},
{"titles": "Manager/ AGM (Finance & Accounts)"},
{"titles": "Night Aiditor"},
{"titles": "Officer- Poultry"},
{"titles": "Business Analyst"},
{"titles": "Sr. Executive - Sales & Marketing (Sewing Thread)"},
{"titles": "Civil Engineer"},
{"titles": "Executive Director-HR"},
{"titles": "Sr. Executive (MIS & Internal Audit)"},
{"titles": "Manager, Health & Safety"},
{"titles": "Computer Engineer (Diploma)"},
{"titles": "Sr. Manager/ Manager, Procurement"},
{"titles": "Specialist, Content"},
{"titles": "Manager, Warranty and Maintenance"},
{"titles": "Asst. Manager - Compliance"},
{"titles": "Officer/Sr. Officer/Asst. Manager (Store)"},
{"titles": "Manager, Maintenance (Sewing)"}
Samsul Islam
  • 2,581
  • 2
  • 17
  • 23

1 Answers1

1

Do no use start_urls, its confusing.

Use start_requests function, this function is called as soon Spider starts.

class BdJobs(scrapy.Spider):
    name = 'bdjobs'
    allowed_domains = ['BdJobs.com']

    def start_requests(self):

        urls = ['http://jobs.bdjobs.com/','http://jobs.bdjobs.com/jobsearch.asp?fcatId=8&icatId=']

        for url in urls:
            yield Request(url,self.parse_detail_page)




    def parse_detail_page(self, response):
        for title in response.xpath('//div[@class="job-title-text"]/a'):
            yield {
            'titles': title.xpath('./text()').extract()[0].strip()
            }

        # TODO
        nextPageLink = GET NEXT PAGE LINK HERE

        yield Request(nextPageLink,self.parse_detail_page)

Notice that you will have to scrape next page link in nextPageLink.

Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • Thanks for your answer. But from Inspect Elements (2) how to grab nextPageLink. please help. thanks a lot. – Samsul Islam Mar 14 '17 at 09:42
  • @Rana they are using Javascript to go to next page. Please share link of website you are scraping, then I can help – Umair Ayub Mar 14 '17 at 09:47
  • This is the link http://jobs.bdjobs.com/jobsearch.asp?fcatId=8&icatId= here is image link https://08733078838609164420.googlegroups.com/attach/58c611bdb536b/bdjobs.png?part=0.1&view=1&vt=ANaJVrFV2Kp9maNcNVw_pnDzbEWDvQejRqvMBghGrENS6EVynyIj7g5IwbPrpM6DxPh7P6Jjq8lQ4m67258zDY5869_nxjWhhGwdtgbWckHnpgxpZel_4vM – Samsul Islam Mar 14 '17 at 10:06
  • 1
    @Rana They are using POST request to navigate to next page ... You can see the POST URL and form parameters in Inspect-View ... You must have to manipulate this using Scrapy. Here is example PHP cURL code for that ... http://pastebin.com/utQYicA7 ... Notice the `pg=15` string .. that defines the page number – Umair Ayub Mar 14 '17 at 10:13
  • Thanks for your help. Could you please provide some python related documentation? so the i can understand it. Thanks a lot. – Samsul Islam Mar 16 '17 at 06:55
  • I have solved my problem, thanks Umair. please see my code https://github.com/ranafge/scrapy/blob/master/bdjobsajaxfinallydone.py – Samsul Islam Jan 16 '18 at 17:08