extracting table rows using BeautifulSoup and sync it with Pandas dataframe

Question

I am trying to extract the values of SMILES String and Repeat_Unit from the table in the following webpage: https://khazana.gatech.edu/module_search/material_detail.php?id=1&m=9

although this might not be the most efficient way, I can successfully extract those values from the following code:

from bs4 import BeautifulSoup
import requests

link='https://khazana.gatech.edu/module_search/material_detail.php?id=1&m=9'
link=requests.get(link)
soup=BeautifulSoup(link.text)

data=[]
tables=soup.find_all('table')

#the desired table was selected based on list index because there is no other attributes
table_body=tables[9].find('tbody')
rows=table_body.findAll('tr')
for row in rows:
    cols=row.findAll('td')
    cols=[ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

print (data[13][1])
print (data[14][1])

In my application, I need to extract the values of SMILES String and Repeat_Unit from 1000s of the similar web page where the html addresses differ only in the number that appears after id= which in this example it is 1.

I have pandas dataframe where one columns has id of the data. in order to get the SMILES String and Repeat Unit for a given id, I modified the above code to:

data=[]
SMILES=[]
Repeat_Unit=[]
for index, prow in df.iterrows():
    a=prow['#id']
    link='https://khazana.gatech.edu/module_search/material_detail.php?id='+str(a)+'&m=9'
    link=requests.get(link)
    soup=BeautifulSoup(link.text)
    tables=soup.find_all('table')
    for table in tables:
        table_body=tables[9].find('tbody') 
        rows=table_body.findAll('tr')
        for row in rows:
            cols=row.findAll('td')
            cols=[ele.text.strip() for ele in cols]
            data.append([ele for ele in cols if ele])
            SMILES.append(data[13][1])
            Repeat_Unit.append(data[14][1])

now when I call SMILES or RepeatUnit, i get the following error:

IndexError                                Traceback (most recent call last)
<ipython-input-55-74f7ef016c59> in <module>()
     36             cols=[ele.text.strip() for ele in cols]
     37             data.append([ele for ele in cols if ele])
---> 38             SMILES.append(data[13][1])
     39             Repeat_Unit.append(data[14][1])

IndexError: list index out of range

even if I loop through the data before appending to SMILES, I still get the same error.

Thank you in advance for your help!

if I use your first piece of code I get an error. That's because the `table` doesn't have a `tbody` — Vivek Kalyanarangan, Aug 12 '18 at 06:38
there is problem with image in `Repeat Unit`, check https://khazana.gatech.edu/module_search/material_detail.php?id=5&m=9. What is expected for this values? Do you need omit `SMILES String - Repeat Unit` pair? Or replace image to empty string? Or something different? — jezrael, Aug 12 '18 at 10:48

jezrael · Accepted Answer · 2018-08-12T12:04:34.647

1

Use:

s = ['SMILES String', 'Repeat Unit']
N = 10
data=[]

for a in np.arange(1,N + 1):
    link='https://khazana.gatech.edu/module_search/material_detail.php?id='+str(a)+'&m=9'
    link=requests.get(link)
    soup=BeautifulSoup(link.text, 'lxml')
    d = {}
    for x in s:
        #https://stackoverflow.com/a/5999786/2901002
        out = soup.find(text=x).parent.findNext('td').contents[0]
        d[x] = out
    data.append(d)

df = pd.DataFrame(data)
print (df)
                                         Repeat Unit  \
0                    C5O3(CH2-OH)-O-C5O3(CH2-OH)-O     
1                                      Polystyrene     
2                          CH2-CH(CH3)-CH2-CH(CH3)     
3                                  CHF-CF2-CHF-CF2     
4  <img border="0" height="60" src="block_images/...   
5                                CNS-C6H3-CSN-C6H3     
6                                    CH(CF3)-O-CH2     
7                                      (CH2)5-O-CO     
8                                CH2-CH2-C(CF3)2-O     
9  <img border="0" height="60" src="block_images/...   

                                       SMILES String  
0                            C(C(O)C1(O))C(CO)OC1O    
1  CC(C1=CC=CC=C1)CC(C2=CC=CC=C2)CC(C3=CC=CC=C3)C...  
2                                  CC(C)CC(C)CC(C)    
3                                      C(F)C(F)(F)    
4                           C(S1)=CC=C1C(S2)=CC=C2    
5           C(OC1=C2)=NC1=CC=C2C(OC3=C4)=NC3=CC=C4    
6                   C(C(F)(F)(F))OCC(C(F)(F)(F))OC    
7                           CCCCCOC(=O)CCCCCOC(=O)    
8  CCC(C(F)(F)(F))(C(F)(F)(F))OCCC(C(F)(F)(F))(C(...  
9                                             CCCC

edited Aug 12 '18 at 12:04

answered Aug 12 '18 at 07:42

jezrael

822,522
95
1,334
1,252

it does not work on python 3 though. When I ran it for the whole 1000s data entry, it returns the following error: --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () 27 for x in s: 28 #https://stackoverflow.com/a/5999786/2901002 ---> 29 out = soup.find(text=x).parent.findNext('td').contents[0] 30 d[x] = out 31 data.append(d) AttributeError: 'NoneType' object has no attribute 'parent' – A.E Aug 12 '18 at 19:02
@A.E - it means there is no value form `s` list, just tested, – jezrael Aug 12 '18 at 19:07
Problem is with `https://khazana.gatech.edu/module_search/material_detail.php?id=253&m=9` – jezrael Aug 12 '18 at 19:18
@jezrael-I had already noticed there there is no id=253&m=9 and id=308&m=9 and already took care of it by looping through the ids in my original dataset but still it returns this error. Do you mind if i send you my csv.file and my jupityer notebook? – A.E Aug 12 '18 at 19:28
@jezrael-Thanks! I just did. – A.E Aug 12 '18 at 19:45
just send 2 possible solutions by email for this problem. – jezrael Aug 12 '18 at 19:51

extracting table rows using BeautifulSoup and sync it with Pandas dataframe

1 Answers1