I have a scrapy spider which gets the start_urls from a MySQL database. When it scrapes each page it comes back with an unknown number of links, meaning it could have zero links or up to 10 links from each page that it scrapes. Because that number is unknown I don't know how best to have the pipeline update the database with all the possible scraped links so instead I have it dumping the start_url and scraped link to a new database. However if I am using a new database, I would like to bring over a searchterm column value for each start_url in to the new database.
If I could grab the searchterm column for each start_url, I could pipe it in to the new database or if someone were to have a different idea on how to UPDATE the original database with an unknown quantity of scraped links, that could work as well.
Here is the spider.py. I have commented out the offending lines
import scrapy
import MySQLdb
import MySQLdb.cursors
from scrapy.http.request import Request
from youtubephase2.items import Youtubephase2Item
class youtubephase2(scrapy.Spider):
name = 'youtubephase2'
def start_requests(self):
conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True)
cursor = conn.cursor()
cursor.execute('SELECT * FROM SearchResults;')
rows = cursor.fetchall()
for row in rows:
if row:
#yield Request(row[0], self.parse, meta=dict(searchterm=row[0]))
yield Request(row[1], self.parse, meta=dict(start_url=row[1]))
cursor.close()
def parse(self, response):
for sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'):
item = Youtubephase2Item()
#item['searchterm'] = response.meta['searchterm']
item['start_url'] = response.meta['start_url']
item['affiliateurl'] = sel.xpath('@href').extract_first()
yield item