-2

this script needs to run all the way through RI_page_urls.csv, then run through all the resulting urls from RI_License_urls.csv and grab the business info.

it's pulling all the url's from RI_page_urls.csv, but then only running and printing the first of 100 urls from RI_License_urls.csv. Need help figuring out how to make it wait for the first part to complete before running the second part.

I appreciate any and all help.

Here's a url for the RI_page_urls.csv to start with:

http://www.crb.state.ri.us/verify_CRB.php

and the code:

from bs4 import BeautifulSoup as soup
import requests as r
import pandas as pd
import re
import csv

#pulls lic# url
with open('RI_page_urls.csv') as f_input:
    csv_input = csv.reader(f_input)

    for url in csv_input:
        data = r.get(url[0])
        page_data = soup(data.text, 'html.parser')
        links = [r'www.crb.state.ri.us/' + link['href']
            for link in page_data.table.tr.find_all('a') if re.search('licensedetail.php', str(link))]

        df = pd.DataFrame(links)
        df.to_csv('RI_License_urls.csv', header=False, index=False, mode = 'a')
#Code Above works!

#need to pull table info from license url    
#this pulls the first record, but doesn't loop through the requests

with open('RI_License_urls.csv') as f_input_2:
    csv_input_2 = csv.reader(f_input_2)

    for url in csv_input_2:
        data = r.get(url[0])
        page_data = soup(data.text, 'html.parser')
        company_info = (' '.join(info.get_text(", ", strip=True).split()) for info in page_data.find_all('h9'))

        df = pd.DataFrame(info, columns=['company_info'])
        df.to_csv('RI_company_info.csv', index=False)
RobK
  • 123
  • 1
  • 10
  • `df.to_csv('RI_company_info.csv', index=False)` repeatedly overwrites the contents of the file on each iteration – roganjosh Oct 01 '18 at 10:31

1 Answers1

1

Well , The question is a bit unclear and also there are a couple of things wrong about the code

data = r.get(url[0])

should be because its urls start with http or https not www

data = r.get("http://"+url[0])

In the below code ,

info is not defined so , i just assumed it should be company_info

 company_info = (' '.join(info.get_text(", ", strip=True).split()) for info in page_data.find_all('h9'))

        df = pd.DataFrame(info, columns=['company_info'])

Hence the full code is

from bs4 import BeautifulSoup as soup
import requests as r
import pandas as pd
import re
import csv

#pulls lic# url
with open('RI_page_urls.csv') as f_input:
    csv_input = csv.reader(f_input)

    for url in csv_input:
        data = r.get(url[0])
        page_data = soup(data.text, 'html.parser')
        links = [r'www.crb.state.ri.us/' + link['href']
            for link in page_data.table.tr.find_all('a') if re.search('licensedetail.php', str(link))]

        df = pd.DataFrame(links)
        df.to_csv('RI_License_urls.csv', header=False, index=False, mode = 'a')
#Code Above works!

#need to pull table info from license url    
#this pulls the first record, but doesn't loop through the requests

with open('RI_License_urls.csv') as f_input_2:
    csv_input_2 = csv.reader(f_input_2)
    with open('RI_company_info.csv','a',buffering=0) as companyinfofiledescriptor:
        for url in csv_input_2:
            data = r.get("http://"+url[0])
            page_data = soup(data.text, 'html.parser')
            company_info = (' '.join(info.get_text(", ", strip=True).split()) for info in page_data.find_all('h9'))

            df = pd.DataFrame(company_info, columns=['company_info'])
            df.to_csv(companyinfofiledescriptor, index=False)
            print(df)
Albin Paul
  • 3,330
  • 2
  • 14
  • 30
  • Albin, it's throwing me this error when i run it: ValueError: can't have unbuffered text I/O – RobK Oct 01 '18 at 10:59
  • how did you run it ? try adding buffering=10 at this line `with open('RI_company_info.csv','a',buffering=0)` – Albin Paul Oct 01 '18 at 11:05
  • ok, so i removed the buffering and it prints out to powershell just fine. But, it's not writing to the "RI_company_info.csv". – RobK Oct 01 '18 at 11:10
  • Then it will write after you stopped the program ;-) – Albin Paul Oct 01 '18 at 11:11
  • well isn't that interesting, it did write after it ran. So, i've got 320 urls from the RI_page_urls.csv which will generate 32000 urls to pull the company info. Will it actually print the 32000 lines of company info to powershell, without crashing, before it writes? – RobK Oct 01 '18 at 11:31
  • i dont know thats why i added buffering put the buffering as a small value then your good – Albin Paul Oct 01 '18 at 11:35
  • ok. so just so i understand, does the buffering allow the code to write to the csv after a certain amount of time, then continue where it left off? – RobK Oct 01 '18 at 11:38
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/181071/discussion-between-albin-paul-and-robk). – Albin Paul Oct 01 '18 at 11:41