Web Scraping Incentive Table

Question

I am trying to web scrape the Incentive Step Tracker table from the URL below. I am only interested in Small Residential Storage.

I got somewhere close but not exactly the full table. Please help finish my code and transform the result into CSV format so I can save to a local folder.

Here is my code:

# import libraries
from bs4 import BeautifulSoup
import urllib.request
import csv

urlpage='https://www.selfgenca.com/home/program_metrics/'

page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
print(soup)

table = soup.find('table',{'class': 'table'}).find_all('tbody',{'data-t': 'Small Residential Storage'})[0]
results = table.find_all('tr')
print(results)

Here is the table I want to scrape:

Ideal Output Table

https://www.selfgenca.com/terms_of_use/ **You may not use any robot, spider or other automatic device, process or means to access, retrieve, scrape, reverse engineer, compile, create derivative works, publically display or otherwise distribute any portion of the Site or the Platform.** — Mike Scotty, Jan 21 '20 at 21:19
Is the problem that you don't see the values or is there another problem? — GabbeHags, Jan 21 '20 at 21:56
@JackFleeting I want the table of Small Residential Storage ONLY in a clean tabular format that I can export to a csv file. — emily_relax, Jan 21 '20 at 22:02
@GabbeHags Hi, I just added my current output to the last paragraph now. The content is correct, but I would like to have a header added and also transform everything to a tabular format so that I can export to csv. The header should be in the ... — emily_relax, Jan 21 '20 at 22:07
So is this the output you want? ```Small Residential Storage Step Status Waitlist Waitlist Waitlist Waitlist ``` — GabbeHags, Jan 21 '20 at 22:29
@GabbeHags Sorry for the confusion, I want a result when you select the budget category equal to Small Residential Storage. Let me add the image of the output table that I want to the post. — emily_relax, Jan 21 '20 at 22:34
@GabbeHags Please take a look at the picture link I just added. I am really just hope to get the whole table when tbody is Small Residential Storage. — emily_relax, Jan 21 '20 at 22:40
Which part are you struggling with, exactly? Also, please do not share code/data as images. See: https://meta.stackoverflow.com/q/303812/11301900. — AMC, Jan 22 '20 at 00:37

score 0 · Accepted Answer · answered Jan 22 '20 at 01:35

0

I think it can be done with pandas, with these changes to your code above:

import pandas as pd

#get the headers
tab = soup.find('table',{'class': 'table'}).find_all('tr',{'class': 'head-row'})
headers=[]
for h in tab[0].find_all('td'):
   headers.append(h.text)

and create a dataframe

final = []
for res in results:
    tmp = []
    for r in res:
        if not 'NavigableString' in str(type(r)):
            tmp.append(r.text.strip())
    final.append(tmp)

df = pd.DataFrame(final,columns=headers)
df

Output looks like the table you want.

answered Jan 22 '20 at 01:35

Jack Fleeting

24,385
6
23
45

Hi Jack, is it possible to add another row to the dataset -- incentive rates (Small Residential Storage only)? For example, it is all at $0.25/Wh now for Small Residential Storage. – emily_relax Jan 23 '20 at 23:15
@FiFiFiona - Not sure what you mean; can you post a sample row like that? – Jack Fleeting Jan 24 '20 at 17:55
Hi Jack, thank you for responding. The code in the next comment actually worked perfectly for my situation. It added two rows to the dataframe, which comes from the second table of incentive rates (ie, 0.25,0.25,0.25,0.25) – emily_relax Jan 24 '20 at 20:47

score 0 · Answer 2 · answered Jan 24 '20 at 05:09

0

row1 = soup.find_all('table',{'class': 'table'})[1].find_all('tr',{'class': 'head-row'})[1]
line1=[]
for h in row1.find_all('td'):
   line1.append(h.text)
row2 = soup.find_all('table',{'class': 'table'})[1].find_all('tr',{'class': ''})[3]
line2=[]
for h in row2.find_all('td'):
   line2.append(h.text)
df1=pd.DataFrame([line1,line2], columns=headers)
df2=df.append(df1,ignore_index=True)
df2

answered Jan 24 '20 at 05:09

Bear

11

Try explaining how your code solves the problem posted by OP. – Pratik Bhavsar Jan 24 '20 at 05:18
It worked!!! I successfully added the new rows of incentive rates of $0.25. Thanks Bear, love you so much! – emily_relax Jan 24 '20 at 05:40

Web Scraping Incentive Table

2 Answers2