0

I have a list of NPIs which I want to scrape the names of the providers for from npidb.org The NPI values are stored in a csv file.

I am able to do it manually by pasting the URLs in the code. However, I am unable to figure out how to do it if I have a list of NPIs for each of which I want the provider names.

Here is my current code:

import scrapy
from scrapy.spider import BaseSpider



class MySpider(BaseSpider):
    name = "npidb"

    def start_requests(self):
        urls = [

            'https://npidb.org/npi-lookup/?npi=1366425381',
            'https://npidb.org/npi-lookup/?npi=1902873227',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-1]
        filename = 'npidb-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
  • you want to provide all the `npi` values from a command line? text file? – eLRuLL Jan 18 '17 at 17:36
  • The NPIs are stored in a csv file that was derived from another code. – infinite-rotations Jan 18 '17 at 17:56
  • What's the structure of the csv file? If you had each URL as an entry per line, you could write something like: open(file_name).read().split() and get a list of all the lines. – Horia Coman Jan 18 '17 at 18:01
  • They are only the NPIs in there. The challenge is to paste them in the URL and get a Name associated with each NPI. This is a really easy thing probably, but I am a complete newbie and unable to crack it. – infinite-rotations Jan 18 '17 at 18:27

2 Answers2

0

Assume you have a list of npi from csv file, then you can simply use format to change the website address as following(I also add the part to get list from csv file. If you have it already, you can omit that part):

    def start_requests(self):
        # get npis from csv file
        npis = []
        with open('test.csv', 'r') as f:
            for line in f.readlines():
                l = line.strip()
                npis.append((l))
       # generate the list of address depending on npi
        start_urls = []
        for npi in npis:
            start_urls.append('https://npidb.org/npi-lookup/?npi={}'.format(npi))
        for url in start_urls:
            yield scrapy.Request(url=url, callback=self.parse)
tomcy
  • 445
  • 3
  • 8
0

Well, it depends on the structure of your csv file, but if it contains the npis in separate lines, you could do something like

def start_requests(self):
    with open('npis.csv') as f:
        for line in f:
            yield scrapy.Request(
                url='https://npidb.org/npi-lookup/?npi={}'.format(line.strip()), 
                callback=self.parse
            )
eLRuLL
  • 18,488
  • 9
  • 73
  • 99