0

I am scraping a website table form https://csr.gov.in/companyprofile.php?year=FY+2015-16&CIN=L00000CH1990PLC010573 but I am not getting the exact result I am looking for. I want 11 columns from this link, "company name", "Class", "State", "Company Type", "RoC", "Sub Category", "Listing Status". These are 7 columns and after that you can see an expand button " CSR Details of FY 2017-18" when you will click on that button you will get 4 more columns "Average Net Profit", "CSR Prescribed Expenditure", "CSR Spent", "Local Area Spent". I want all these columns in csv file. I wrote a code and it is not working properly. I am attaching an Image of result for refference. and here is my code. please help to get these data.

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import csv

driver = webdriver.Chrome()

url_file = "csrdata.txt"

with open(url_file, "r") as url:
    url_pages = url.read()
# we need to split each urls into lists to make it iterable
pages = url_pages.split("\n") # Split by lines using \n
data = []

# now we run a for loop to visit the urls one by one
for single_page in pages:
    driver.get(single_page)
    r = requests.get(single_page)
    soup = BeautifulSoup(r.content, 'html5lib')
    driver.find_element_by_link_text("CSR Details of FY 2017-18").click()
    table = driver.find_elements_by_xpath("//*[contains(@id,'colfy4')]")
    about = table.__getitem__(0).text
    x = about.split('\n')

    print(x)

    data.append(x)
    df = pd.DataFrame(data)
    print(df)

    # write to csv
    df.to_csv('csr.csv')

enter image description here

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
Sumit Jha
  • 27
  • 6

1 Answers1

1

You dont need to use selenium since all the informations are inside the html code. Also you can use pandas inbuild function pd_read_html() to directly transform the html-table into a dataframe.

data = []
for single_page in pages:
    r = requests.get(single_page)
    soup = BeautifulSoup(r.content, 'html5lib')

    table = soup.find_all('table')               #finds all tables
    table_top = pd.read_html(str(table))[0]      #the top table

    try:                                         #try to get the other table if exists
        table_extra = pd.read_html(str(table))[7]
    except:
        table_extra = pd.DataFrame()            
    
    result = pd.concat([table_top, table_extra])
    data.append(result)

pd.concat(data).to_csv('test.csv')

output:

                            0                          1
0                       Class                     Public
1                       State                 Chandigarh
2                Company Type           Other than Govt.
3                         RoC             RoC-Chandigarh
4                Sub Category  Company limited by shares
5              Listing Status                     Listed
0          Average Net Profit                          0
1  CSR Prescribed Expenditure                          0
2                   CSR Spent                          0
3            Local Area Spent                          0
Jonas
  • 1,749
  • 1
  • 10
  • 18
  • getting error on 5th link, because there is no table in " CSR Details of FY 2017-18" """"""'table_extra = pd.read_html(str(table))[7] # the table from "CSR Details of FY 2017-18" IndexError: list index out of range""""""' – Sumit Jha Jul 17 '20 at 18:07
  • how can I set an exception that if there is no table found in " CSR Details of FY 2017-18" like this link "https://csr.gov.in/companyprofile.php?year=FY+2017-18&CIN=L01110GJ1991PLC015846". please help – Sumit Jha Jul 17 '20 at 18:08
  • You can use "try" and "except". I edited my post above. – Jonas Jul 17 '20 at 18:09
  • I am very thankfull to you for the help, I have one more issue. code is running very well and output is printing. but I am not getting the outout csv file. I tried too much but not getting output csv file in destination folder. – Sumit Jha Jul 18 '20 at 14:01
  • Do you want to save the DataFrame for each page in an unique csv-file? – Jonas Jul 18 '20 at 14:04
  • I need all page results in single csv file. – Sumit Jha Jul 18 '20 at 14:11
  • In this case you can save all the DataFrame of each page into a list and then concat this list to csv. I edited my post above. In this line: pd.concat(data).to_csv('test.csv') you can change the path where i shall be saved (change 'test.csv' to the path you want). – Jonas Jul 18 '20 at 14:23
  • 1
    Thanks a lot for your help. Can you please help suggest me a better plateform to learn web scraping using Python? Once aagain thank you for the help. – Sumit Jha Jul 18 '20 at 21:20
  • Happy to help. Hard to suggest something, because I don’t know how best you learn, but I really like the YouTube channel “TechWithTim”. He has a video of web scraping with python and a lot of other python tutorials. Maybe this helps you, too. – Jonas Jul 18 '20 at 21:27