How to extract html text output as list for each input from list using python web scraping. I have written code, but gives only first entry output

Question

I am new to python and programming. I am trying to extract pubchem ID from database called IMPAAT(https://cb.imsc.res.in/imppat/home). I have a list of chemical ids from the database for a herb, where going into each chemical ID hyperlink gives details on its pubchem ID and smiles data.

I have written a script in python to take each chemical ID as input and look for pubchem ID from the html page and print output to a text file using API web scraping method.

I am finding it difficult to get all the data as output. Pretty sure there is some error in the for loop as it prints only the first output many times, instead of the different output for each input.

Please help with this.

Also, I dont know how to save this kind of file where it prints input and corresponding output side by side. Please help.

import requests
import xmltodict
from pprint import pprint
import time
from bs4 import BeautifulSoup
import json
import pandas as pd
import os
from pathlib import Path
from tqdm.notebook import tqdm

cids = 'output.txt'

df = pd.read_csv(cids, sep='\t')
df

data = []

for line in df.iterrows():
    
out = requests.get(f'https://cb.imsc.res.in/imppat/Phytochemical-detailedpage-auth/CID%{line}')
    
    soup = BeautifulSoup(out.text, "html.parser")
    
    if soup.status_code == 200:
        script_data = soup.find('div', {'class': 'views-field views-field-Pubchem-id'}).find('span', {'class': 'field-content'}).find('h3')
    #print(script_data.text)
    
    for text in script_data:
        
        texts = script_data.get_text()
        
        print(text)
    
    data.append(text)
   
    
print(data)
    

****
input file consists of 

cids
0   3A155934
1   3A117235
2   3A12312921
3   3A12303662
4   3A225688
5   3A440966
6   3A443160 ```

I haven't checked the entire code but, `texts = script_data.get_text()` the variable used is `text` without the `s` as the loop variable and that is being printed `texts` is never used — shoaib30, Jul 27 '21 at 07:03
consider using a list input instead of csv in for future posts -- this makes it easier for people to reproduce and test your code — willwrighteng, Jul 27 '21 at 18:18

score 0 · Accepted Answer · answered Jul 27 '21 at 12:17

0

There are few things you need to correct in your code.

Incorrect indentation of out variable.
Status Code should be checked on response object i.e., out not soup.
You don't need second loop as each response contains only single pubchem ID which you are already collecting in script_data variable.
Lastly, you can use pandas to associate each chemical ID to its pubchem ID and then can write to CSV file.

Refer to below code for complete result.

Code

import requests
import xmltodict
from pprint import pprint
import time
from bs4 import BeautifulSoup
import json
import pandas as pd
import os
from pathlib import Path
from tqdm.notebook import tqdm

cids = 'output.txt'

df = pd.read_csv(cids, sep='\t')

pubchem_id= []

for line in df.iterrows():
    
    out = requests.get(f'https://cb.imsc.res.in/imppat/Phytochemical-detailedpage-auth/CID%{line}')

    if out.status_code == 200:
        
        soup = BeautifulSoup(out.text, "html.parser")

        script_data = soup.find('div', {'class': 'views-field views-field-Pubchem-id'}).find('span', {'class': 'field-content'}).find('h3').getText()
    
        script_data = script_data.replace('PubChem Identifier:','')
        
        pubchem_id.append(script_data)

# As you have not mentioned column index of cids, I am assuming it should be the first column
df1 = pd.DataFrame({"chemical_id": df.iloc[:, 0].tolist(), "pubchem_id": pubchem_id})
    
print(df1)

# uncomment below line to write the dataframe into csv files & replace 'filename' by the complete filepath
# df1.to_csv('filename.csv')

answered Jul 27 '21 at 12:17

Shivam

610
1
5
6

Hi, Thanks for your inputs.When I run the above code, I get Attribute error: ``` AttributeError Traceback (most recent call last) in 22 soup = BeautifulSoup(out.text, "html.parser") 23 ---> 24 text = soup.find('div', {'class': 'views-field views-field-Pubchem-id'}).find('span', {'class': 'field-content'}).find('h3').getText() 25 26 print(text) AttributeError: 'NoneType' object has no attribute 'find'``` – Boo Jul 28 '21 at 05:00
But when I tried only for single input, I got the below output. But if I do the same with list of input using for loop, I am getting the attribute error. ```cids = '3A155934' out = requests.get(f'https://cb.imsc.res.in/imppat/Phytochemical-detailedpage-auth/CID%{cids}') soup = BeautifulSoup(out.text, "html.parser") text = soup.find('div', {'class': 'views-field views-field-Pubchem-id'}).find('span', {'class': 'field-content'}).find('h3') print(text) output was PubChem Identifier: 155934''' – Boo Jul 28 '21 at 05:02
Looks like you are not running my code. text = soup.find('div', {'class': 'views-field views-field-Pubchem-id'}).find('span', {'class': 'field-content'}).find('h3').getText(), I never used text variable in my code. Moreover, check with the way you are iterating over loop. – Shivam Jul 28 '21 at 07:03
Hello, I did use your code, it has get.Text(), if that is what you asked, it still gives the same Attribute error for the file list input. And then I tried to re-run with single ID, and got the output. I tried the same with my code as well, and got the output fr single ID. I dont knwo what I am missing in the ID list input. I even checked with print(text) to see what it gives for the list input. It doesnt show any search ID that used to appear when I gice single ID. IDK what is going wrong. If possible, could you please explain what exactly in the for loop is missing? – Boo Jul 29 '21 at 09:07
You didn't get me, I am talking about text variable not get.Text() method. Can you print(line) just after the loop to see if you are getting chemical ID. – Shivam Jul 29 '21 at 13:04
Hi, Thank you for your help a lot. This was my first query and you generously helped. I got the needed output and I thank you heartfully for helping me all the way. Thanks a lot! Look forward to more programming explorations!!! – Boo Aug 02 '21 at 04:45
Great to know it helped you. Would you mind to accept the answer? – Shivam Aug 02 '21 at 05:36
Yes! Do I need to do something to notify I am convinced with your help? – Boo Aug 02 '21 at 06:20
Would you mind to help with this other question too... https://stackoverflow.com/questions/68617001/indexerror-list-index-out-of-range-for-stitch-api-when-i-do-api-using-python – Boo Aug 02 '21 at 06:21
There's a 'check' thing (like a v) on the left side of answer which you can click on. When it turns green it means you accepted that answer. Hope that helps :) – Shivam Aug 02 '21 at 06:51
Sure, I will look at it. – Shivam Aug 02 '21 at 07:00

How to extract html text output as list for each input from list using python web scraping. I have written code, but gives only first entry output

1 Answers1