I'm trying to deploy a spider to scrapinghub and cannot figure out how to tackle a data input problem. I need to read IDs from a csv and append them to my start urls as a list comprehension for the spider to crawl:
class exampleSpider(scrapy.Spider):
name = "exampleSpider"
#local scrapy method to extract data
#PID = pd.read_csv('resources/PID_list.csv')
#scrapinghub method
csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
start_urls = ['http://www.example.com/PID=' + str(x) for x in csvdata]
The requirements file and pkgutil.get_data parts works but I'm stuck on converting the data IO into the list. What's the process for converting the data call into the list comprehension?
EDIT: Thanks! This got me 90% of the way there!
class exampleSpider(scrapy.Spider):
name = "exampleSpider"
#local scrapy method to extract data
#PID = pd.read_csv('resources/PID_list.csv')
#scrapinghub method
csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
csvio = StringIO(csvdata)
raw = csv.reader(csvio)
# TODO : update code to get exact value from raw
start_urls = ['http://www.example.com/PID=' + str(x[0]) for x in raw]
The str(x) needed str(x[0]) as a quick fix since the loop was reading in the square brackets in url encoding which broke the links:
str(x)
resulted in "http://www.example.com/PID=%5B'0001'%5D"
but str(x[0])
gets it out of the list brackets: "http://www.example.com/PID='0001'"