Scrapy Pipeline unknown number of results

Question

I have a scrapy spider which gets the start_urls from a MySQL database. When it scrapes each page it comes back with an unknown number of links, meaning it could have zero links or up to 10 links from each page that it scrapes. Because that number is unknown I don't know how best to have the pipeline update the database with all the possible scraped links so instead I have it dumping the start_url and scraped link to a new database. However if I am using a new database, I would like to bring over a searchterm column value for each start_url in to the new database.

If I could grab the searchterm column for each start_url, I could pipe it in to the new database or if someone were to have a different idea on how to UPDATE the original database with an unknown quantity of scraped links, that could work as well.

Here is the spider.py. I have commented out the offending lines

import scrapy
import MySQLdb
import MySQLdb.cursors
from scrapy.http.request import Request

from youtubephase2.items import Youtubephase2Item

class youtubephase2(scrapy.Spider):
    name = 'youtubephase2'

    def start_requests(self):
        conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True)
        cursor = conn.cursor()
        cursor.execute('SELECT * FROM SearchResults;')
        rows = cursor.fetchall()

        for row in rows:
            if row:
                #yield Request(row[0], self.parse, meta=dict(searchterm=row[0]))
                yield Request(row[1], self.parse, meta=dict(start_url=row[1]))
        cursor.close()

    def parse(self, response):
        for sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'):
            item = Youtubephase2Item()
            #item['searchterm'] = response.meta['searchterm']
            item['start_url'] = response.meta['start_url']
            item['affiliateurl'] = sel.xpath('@href').extract_first()
            yield item

score 0 · Accepted Answer · answered Apr 14 '17 at 15:59

I'm not sure if I understand you correctly, but you can carry several items in meta.

Assuming you have this table:

# table1
ID|URLS     | NAME      | ADDRESS    |
0 |foo.com  | foo       | foo 1      |
1 |bar.com  | bar       | bar 1      |

Yield request for every row and parse yield as many items you want to a new table:

def start_requests(self):
    rows = ...
    for row in rows:
        url = row[1]
        yield Request(url, meta={'row' row})

def parse(self, response):
    links = ...
    row = response.meta['row']
    for link in links:
        item = dict()
        item['urls'] = row[1]
        item['name'] = row[2]
        item['address'] = row[3]
        # add some stuff...
        item['link'] = link
        yield item

And save all items to database and you'll end up with:

# table2
ID|URLS     | NAME      | ADDRESS    | LINK     |
0 |foo.com  | foo       | foo 1      | link1    |
1 |foo.com  | foo       | foo 1      | link2    |
2 |foo.com  | foo       | foo 1      | link3    |
3 |bar.com  | bar       | bar 1      | link1    |
4 |bar.com  | bar       | bar 1      | link2    |
5 |bar.com  | bar       | bar 1      | link3    |

Using your logic I was able to figure out that I could change my meta line from: yield Request(row[1], self.parse, meta=dict(start_url=row[1])) to: yield Request(row[1], self.parse, meta=dict(start_url=row[1], searchterm=row[0])) and get the other field I wanted. Thanks for the help. — SDailey, Apr 15 '17 at 02:20

Scrapy Pipeline unknown number of results

1 Answers1