-2

I have been trying to extract a table but it retrieves only the heading of the table. This is my first way to retrieve the table.

url = r"https://www.sec.gov/edgar/search/#/q=Women&dateRange=custom&entityName=Infosys&startdt=2010-03-01&enddt=2020-03-01"

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find_all("table")[1]

#Extracting heading of the columns of the table.

rows = table.find_all('tr')
columns=[]
headings = rows[0].find_all('th')
for col in headings:
    columns.append(col.text.strip())
print(columns)

#Extracting all data of the table row wise.

all_data=[]
for row in rows[1:]:
    data = row.find_all('td')
    lst=[]
    for d in data:
        lst.append(d.text.strip())
    all_data.append(lst)

 #Creating the dataframe out of the extracted data.

ds = pd.DataFrame(all_data, columns=columns)
ds

Second way:

ds1 = pd.read_html(url)[0]
ds1

When I tried to search the table, I get all the columns heading in the thead tag, but I get an empty tbody.

table = soup.find_all("table", class_='table')
table

Output:

 [<table class="table table-hover entity-hints" id="asdf"></table>,
 <table class="table">
 <thead>
 <tr>
 <th class="filetype" id="filetype">Form &amp; File</th>
 <th class="filed">Filed</th>
 <th class="enddate">Reporting for</th>
 <th class="entity-name">Filing entity/person</th>
 <th class="cik">CIK</th>
 <th class="located">Located</th>
 <th class="incorporated">Incorporated</th>
 <th class="file-num">File number</th>
 <th class="film-num">Film number</th>
 </tr>
 </thead>
 <tbody>
 </tbody>
 </table>]

Why the tbody tag is empty?

Sceenshot of table:

enter image description here

Prakriti Shaurya
  • 187
  • 1
  • 3
  • 14
  • Does this answer your question? [Web scraping program cannot find element which I can see in the browser](https://stackoverflow.com/questions/60904786/web-scraping-program-cannot-find-element-which-i-can-see-in-the-browser) – AMC Nov 09 '20 at 23:21
  • Does this answer your question? [How to scrape dynamic webpages by Python](https://stackoverflow.com/questions/33795799/how-to-scrape-dynamic-webpages-by-python) – Sabito stands with Ukraine Nov 10 '20 at 00:46

2 Answers2

3

The table is loaded via sending a POST request to https://efts.sec.gov/LATEST/search-index. You can scrape the data as follows:

import json
import requests
from bs4 import BeautifulSoup


URL = "https://efts.sec.gov/LATEST/search-index"
data = {
    "q": "Women",
    "dateRange": "custom",
    "entityName": "Infosys",
    "startdt": "2010-03-01",
    "enddt": "2020-03-01",
}

soup = BeautifulSoup(requests.post(URL, data=json.dumps(data)).content, "html.parser")

json_data = json.loads(str(soup))

fmt_string = "{:<25} {:<20} {:<20} {:<20}"
print(
    fmt_string.format("Form & File", "Filed", "Reporting for", "Filing/entity person")
)
print("-" * 100)

for data in json_data["hits"]["hits"]:
    form = data["_source"]["root_form"] + data["_source"]["file_type"]
    filed = data["_source"]["file_date"]
    reporting_for = data["_source"]["period_ending"]
    entity = data["_source"]["display_names"][0].split("(CIK")[0]

    print(fmt_string.format(form, filed, reporting_for, entity))

Output:

Form & File               Filed                Reporting for        Filing/entity person
----------------------------------------------------------------------------------------------------
6-KEX-99.1 CHARTER        2016-01-14           2015-12-31           Infosys Ltd  (INFY)  
6-KEX-99.3 VOTING TRUST   2016-07-20           2016-06-30           Infosys Ltd  (INFY)  
6-KEX-99.1 CHARTER        2014-01-15           2013-12-31           Infosys Ltd  (INFY)  
6-KEX-99.1                2014-01-10           2013-12-31           Infosys Ltd  (INFY)  
6-KEX-99.1 CHARTER        2019-10-11           2019-09-30           Infosys Ltd  (INFY)  
6-KEX-99.2 BYLAWS         2019-10-16           2019-09-30           Infosys Ltd  (INFY)  
20-F20-F                  2016-05-18           2016-03-31           Infosys Ltd  (INFY)  
6-KEX-99.2                2016-01-19           2015-12-31           Infosys Ltd  (INFY)  
20-F20-F                  2019-06-19           2019-03-31           Infosys Ltd  (INFY)  
6-KEX-99.1 CHARTER        2013-12-20           2013-12-20           Infosys Ltd  (INFY)  
20-F20-F                  2017-06-12           2017-03-31           Infosys Ltd  (INFY)  
20-F20-F                  2014-05-09           2014-03-31           Infosys Ltd  (INFY)  
6-KEX-99.2 BYLAWS         2014-01-15           2013-12-31           Infosys Ltd  (INFY)  
6-KEX-99.1 CHARTER        2019-10-16           2019-09-30           Infosys Ltd  (INFY)  
20-F20-F                  2018-07-19           2018-03-31           Infosys Ltd  (INFY)  
6-K6-K                    2013-12-20           2013-12-20           Infosys Ltd  (INFY)  
6-KEX-99.1                2016-01-19           2015-12-31           Infosys Ltd  (INFY)  
6-K6-K                    2014-03-28           2014-03-28           Infosys Ltd  (INFY)  
20-F20-F                  2015-05-20           2015-03-31           Infosys Ltd  (INFY)  
6-KEX-99.3 VOTING TRUST   2010-07-16           2010-06-30           INFOSYS TECHNOLOGIES LTD  (INFY)  
MendelG
  • 14,885
  • 4
  • 25
  • 52
0

Just to add on the above answer is correct except you can skip using BeautifulSoup, just set the headers yourself and use json= instead of data= in your requests.post. Then json.load the response into a dictionary.

And yes building the link is quite easy given the information in each "hits" dictionary returned per entry:

  • ["_id"] contains the accession number along with the specific document filename
  • ["_source"]["ciks"] list of CIK numbers of the filers of the document. Should be able to build a path with either
  • ["_source"]["adsh"] is the accession number again, by itself

Putting these 3 together is quite simple: CIK + ACCN + Filename, according to the format described under "Paths and directory structure" and "Directory browsing" at https://www.sec.gov/os/accessing-edgar-data

Keep in mind that only 100 results are returned at a time, and you will have to send multiple requests if there are more total results than that and you want them all. You will have to figure out your own loop logic, but the key mechanism is another query parameter to include in your requests: "from=RESULT_NUM_TO_START_FROM"

Play around with the network tab of chrome's inspector as you change pages on the results page to see whats going on. There's more to scraping the full text search endpoint than the others but you can make some cool queries lol.