0

I am trying to scrape the sec to get all 10-K filing-links of a company selected by input. The program loops through each quarter (QTR1-4) in each year from 1993 until now. I got the code from https://codingandfun.com/scraping-sec-edgar-python/ When executing I run into: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 13013584: invalid continuation byte Without my for loop for the years and a fixed year/quarter it works - so whats the problem here?

import bs4 as bs
import requests
import pandas as pd
import re
from datetime import datetime

def get_base():
    company = input('Which company?: ')
    filing = '10-K'
    year = [*range(1993,datetime.now().year + 1)]
    quarter = ['QTR1','QTR2','QTR3','QTR4']
    #get all filings for each quarter(QTR1-4) in each year(beginning 1993 until actual year)
    for x in year:
        for y in quarter:
            download = requests.get(f'https://www.sec.gov/Archives/edgar/full-index/{x}/{y}/master.idx').content
            download = download.decode("utf-8").split('\n')
            for item in download:
                #company name and report type
                if (company in item) and (filing in item): 
                   
                    company = item
                    company = company.strip()
                    splitted_company = company.split('|')
                    url = splitted_company[-1]
                    
                    #build second part of the url
                    url2 = url.split('-') 
                    url2 = url2[0] + url2[1] + url2[2]
                    url2 = url2.split('.txt')[0] 

                    # build third part of the url
                    to_get_html_site = 'https://www.sec.gov/Archives/' + url
                    data = requests.get(to_get_html_site).content
                    data = data.decode("utf-8") 
                    data = data.split('FILENAME>')
                    data = data[1].split('\n')[0]

                    #combine
                    url_to_use = 'https://www.sec.gov/Archives/'+ url2 + '/'+data
                    print(url_to_use)
                    

get_base()
  • 10-Ks are annual filings. If you are looking for those, why are you cycling through quarters? – Jack Fleeting Apr 05 '21 at 12:48
  • Mostly because I dont know the quarter in which the files have been published. It's either Q4 or Q1. Later on I also want to use the quarterly data –  Apr 05 '21 at 13:17
  • You don't need to know the quarter - there's only one 10-K a year. – Jack Fleeting Apr 05 '21 at 14:18
  • Yes I understand but I want to get 10-Q also later on and I cannot search by Year. You need Year & Quarter to search master file. And some companies share 2020 10-k in q4 2020 and some share it in q1 2021 –  Apr 05 '21 at 15:38
  • I don't know if you are aware of it, but some edgar API wrappers already exist for python. (https://github.com/edgarminers/python-edgar, https://github.com/joeyism/py-edgar ...). Or maybe it's for practice? – ce.teuf Apr 05 '21 at 15:53
  • Yes thank you, i've installed both of them but at least edgar library is not working for me and throwing ImportError,so I could not test it. I really need all of the available data for 10-k, 10-q –  Apr 05 '21 at 16:30
  • Ok the problem was both packages import edgar the same way. I deleted python-edgar –  Apr 05 '21 at 16:36
  • Ok this package is not a solution. I want to go more specific, by year and also you can access single tables for each file by changing url components. See this video on youtube https://www.youtube.com/watch?v=4zE9HjPIqC4&list=RDCMUCBsTB02yO0QGwtlfiv5m25Q&index=1 –  Apr 05 '21 at 16:42

1 Answers1

0

You need to specify headers :

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}

download = requests.get(f'https://www.sec.gov/Archives/edgar/full-index/{x}/{y}/master.idx', headers=headers).text

Change .content to .text at the end of the line. For filter headers of the download file consider filters as :

data = [row.split('|') for row in download.split("\n") if '|' in row]
data_10k = [row for row in data if row[2] == "10-K"]
data_10Q = [row for row in data if row[2] == "10-Q"]

I hope this was able to unblock you in your work but I don't think it is an efficient way to do what you are trying to do. (it could be long, and you need to know the exact name of the company...)

ce.teuf
  • 746
  • 6
  • 13
  • 1
    Thank you very much for your advice. Biggest Problem ist the speed of it thats right! I didn't really find a good / nice way to scrape the sec. I'm working on other solutions too, but I'm learning python since 3 months - so it's not too easy for me. I want to build a Dash Dashboard later on where you can search a company in a suggestion based searchbar - so the company name or CIK wouldn't be a problem. If you have any ideas feel free to contact me. This is something I see as a big project to learn by doing –  Apr 05 '21 at 19:32
  • If your entry point is the company name, you should use the suggestion tool that sec offers: https://www.sec.gov/edgar/searchedgar/companysearch.html. You need to learn how to easily use the development tools of your browser (F12 -> network -> etc.). It is indeed an interesting project to learn how to handle economic, quantitative and textual data as well as web api notions. Have fun ;) – ce.teuf Apr 05 '21 at 22:36