sec.gov scraping with nested for-loops running into Unicode Error

Question

I am trying to scrape the sec to get all 10-K filing-links of a company selected by input. The program loops through each quarter (QTR1-4) in each year from 1993 until now. I got the code from https://codingandfun.com/scraping-sec-edgar-python/ When executing I run into: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 13013584: invalid continuation byte Without my for loop for the years and a fixed year/quarter it works - so whats the problem here?

import bs4 as bs
import requests
import pandas as pd
import re
from datetime import datetime

def get_base():
    company = input('Which company?: ')
    filing = '10-K'
    year = [*range(1993,datetime.now().year + 1)]
    quarter = ['QTR1','QTR2','QTR3','QTR4']
    #get all filings for each quarter(QTR1-4) in each year(beginning 1993 until actual year)
    for x in year:
        for y in quarter:
            download = requests.get(f'https://www.sec.gov/Archives/edgar/full-index/{x}/{y}/master.idx').content
            download = download.decode("utf-8").split('\n')
            for item in download:
                #company name and report type
                if (company in item) and (filing in item): 
                   
                    company = item
                    company = company.strip()
                    splitted_company = company.split('|')
                    url = splitted_company[-1]
                    
                    #build second part of the url
                    url2 = url.split('-') 
                    url2 = url2[0] + url2[1] + url2[2]
                    url2 = url2.split('.txt')[0] 

                    # build third part of the url
                    to_get_html_site = 'https://www.sec.gov/Archives/' + url
                    data = requests.get(to_get_html_site).content
                    data = data.decode("utf-8") 
                    data = data.split('FILENAME>')
                    data = data[1].split('\n')[0]

                    #combine
                    url_to_use = 'https://www.sec.gov/Archives/'+ url2 + '/'+data
                    print(url_to_use)
                    

get_base()

10-Ks are annual filings. If you are looking for those, why are you cycling through quarters? — Jack Fleeting, Apr 05 '21 at 12:48
Mostly because I dont know the quarter in which the files have been published. It's either Q4 or Q1. Later on I also want to use the quarterly data — , Apr 05 '21 at 13:17
You don't need to know the quarter - there's only one 10-K a year. — Jack Fleeting, Apr 05 '21 at 14:18
Yes I understand but I want to get 10-Q also later on and I cannot search by Year. You need Year & Quarter to search master file. And some companies share 2020 10-k in q4 2020 and some share it in q1 2021 — , Apr 05 '21 at 15:38
I don't know if you are aware of it, but some edgar API wrappers already exist for python. (https://github.com/edgarminers/python-edgar, https://github.com/joeyism/py-edgar ...). Or maybe it's for practice? — ce.teuf, Apr 05 '21 at 15:53
Yes thank you, i've installed both of them but at least edgar library is not working for me and throwing ImportError,so I could not test it. I really need all of the available data for 10-k, 10-q — , Apr 05 '21 at 16:30
Ok the problem was both packages import edgar the same way. I deleted python-edgar — , Apr 05 '21 at 16:36
Ok this package is not a solution. I want to go more specific, by year and also you can access single tables for each file by changing url components. See this video on youtube https://www.youtube.com/watch?v=4zE9HjPIqC4&list=RDCMUCBsTB02yO0QGwtlfiv5m25Q&index=1 — , Apr 05 '21 at 16:42

score 0 · Answer 1 · answered Apr 05 '21 at 18:31

0

You need to specify headers :

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}

download = requests.get(f'https://www.sec.gov/Archives/edgar/full-index/{x}/{y}/master.idx', headers=headers).text

Change .content to .text at the end of the line. For filter headers of the download file consider filters as :

data = [row.split('|') for row in download.split("\n") if '|' in row]
data_10k = [row for row in data if row[2] == "10-K"]
data_10Q = [row for row in data if row[2] == "10-Q"]

I hope this was able to unblock you in your work but I don't think it is an efficient way to do what you are trying to do. (it could be long, and you need to know the exact name of the company...)

answered Apr 05 '21 at 18:31

ce.teuf

746
6
13

1

Thank you very much for your advice. Biggest Problem ist the speed of it thats right! I didn't really find a good / nice way to scrape the sec. I'm working on other solutions too, but I'm learning python since 3 months - so it's not too easy for me. I want to build a Dash Dashboard later on where you can search a company in a suggestion based searchbar - so the company name or CIK wouldn't be a problem. If you have any ideas feel free to contact me. This is something I see as a big project to learn by doing – Apr 05 '21 at 19:32
If your entry point is the company name, you should use the suggestion tool that sec offers: https://www.sec.gov/edgar/searchedgar/companysearch.html. You need to learn how to easily use the development tools of your browser (F12 -> network -> etc.). It is indeed an interesting project to learn how to handle economic, quantitative and textual data as well as web api notions. Have fun ;) – ce.teuf Apr 05 '21 at 22:36

sec.gov scraping with nested for-loops running into Unicode Error

1 Answers1