0

I am new to scrapy and python in general and i am trying to make a scraper that extracts links from a page then edit these links then go through each one of them .. I am using playwright with scrapy.

this is where i am at but for some reason it only scrapes the first link only.

 def parse(self, response):
        for link in response.css('div.som a::attr(href)'):
            yield response.follow(link.get().replace('docs', 'www').replace('com/', 'com/#'),
                                  cookies={'__utms': '265273107'},
                                  meta=dict(
                                      playwright=True,
                                      playwright_include_page=True,
                                      playwright_page_coroutines=[
                                          PageCoroutine('wait_for_selector', 'span#pple_numbers')]
                                  ),
                                  callback=self.parse_c)

    async def parse_c(self, response):
        yield {
            'text': response.css('div.pple_numb span::text').getall()
nas22663
  • 1
  • 1

2 Answers2

0

it would be nice if you could add more details about the data you are trying to get. Thefore, could you add the indicated line to see if it is going through different links?

 def parse(self, response):
        for link in response.css('div.som a::attr(href)'):
            print(link) <--- //could you add this line to check if prints all the links?
pedro_bb7
  • 1,601
  • 3
  • 12
  • 28
  • yes it capturing all the link just fine with the print but not sure if its suppose to be a string or lists when i print the type it gets me this – nas22663 Jan 11 '22 at 00:16
  • You could also print response before the for loop to check the data is coming and we get better idea about the problem. – pedro_bb7 Jan 11 '22 at 00:38
  • everything is normal until it starts scraping it goes through the first line ( link.get ) only and i tried to put it in a list same issue .. maybe the callback function is somehow wrong and suppose to iterate through all the links but i honestly have no idea as i said i am very newbie – nas22663 Jan 11 '22 at 04:58
0

According to the documentation there are two functions for follow:

  1. follow:

Return a Request instance to follow a link url. It accepts the same arguments as Request.__init__ method, but url can be not only an absolute URL, but also a relative URL, a Link object, e.g. the result of Link Extractors, ...

  1. follow_all

A generator that produces Request instances to follow all links in urls. It accepts the same arguments as the Request’s __init__ method, except that each urls element does not need to be an absolute URL, it can be any of the following: a relative URL, a Link object, e.g. the result of Link Extractors, ...

Probably if you try your code with follow_all instead of only follow it should do the trick.

pedro_bb7
  • 1,601
  • 3
  • 12
  • 28
  • 1
    no that wasn't it bud the problem was that i add com/# so it treats it as a duplicate request or something so i just added dont_filter=True to the response and all went ok thanks a lot man really appreciate your help – nas22663 Jan 12 '22 at 09:11