0

I'm trying to deploy a spider to scrapinghub and cannot figure out how to tackle a data input problem. I need to read IDs from a csv and append them to my start urls as a list comprehension for the spider to crawl:

class exampleSpider(scrapy.Spider):
    name = "exampleSpider"

    #local scrapy method to extract data
    #PID = pd.read_csv('resources/PID_list.csv')

    #scrapinghub method
    csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")


    start_urls = ['http://www.example.com/PID=' + str(x) for x in csvdata]

The requirements file and pkgutil.get_data parts works but I'm stuck on converting the data IO into the list. What's the process for converting the data call into the list comprehension?

EDIT: Thanks! This got me 90% of the way there!

class exampleSpider(scrapy.Spider):
    name = "exampleSpider"

    #local scrapy method to extract data
    #PID = pd.read_csv('resources/PID_list.csv')

    #scrapinghub method
    csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
    csvio = StringIO(csvdata)
    raw = csv.reader(csvio)

    # TODO : update code to get exact value from raw 
    start_urls = ['http://www.example.com/PID=' + str(x[0]) for x in raw]

The str(x) needed str(x[0]) as a quick fix since the loop was reading in the square brackets in url encoding which broke the links: str(x) resulted in "http://www.example.com/PID=%5B'0001'%5D" but str(x[0]) gets it out of the list brackets: "http://www.example.com/PID='0001'"

mth10
  • 3
  • 2
  • How does PID_list.csv look like? Can you include a few lines? – user2314737 Apr 26 '19 at 06:30
  • The list is a csv that looks like: 0001,0002,0003 – mth10 Apr 27 '19 at 01:21
  • That looks like a comma-separated list (items are on the same line, separated by commas), not a CSV (rows are separated by new lines, columns are separated by a separating character like a comma). Can you confirm that the file is actually a command-separated list of IDs? – Gallaecio Apr 30 '19 at 10:34

1 Answers1

1
class exampleSpider(scrapy.Spider):
    name = "exampleSpider"

    #local scrapy method to extract data
    #PID = pd.read_csv('resources/PID_list.csv')

    #scrapinghub method
    csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
    csvio = StringIO(csvdata)
    raw = csv.reader(csvio)

    # TODO : update code to get exact value from raw 
    start_urls = ['http://www.example.com/PID=' + str(x) for x in raw]

You can use StringIO to turn a string into something with a read() method, which csv.reader should be able to handle. I hope this will help you :)

PyMaster
  • 1,094
  • 7
  • 11