- I think it will be easier and faster to use
pandas
with vectorized functions
- This is 5 lines of code to get all the counts and it's fast.
- Once the xml file counts are available and the paths to all the
.xml
files, consider looking at How to convert an XML file to nice pandas dataframe? to automate processing of those files.
import pandas as pd
# list to index.json for Archives
paths = ['https://www.sec.gov/Archives/edgar/data/1736260/000119312515118890/index.json',
'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/index.json']
# download and each json and join it into a single dataframe
# reset the index, so each row has a unique index number
df = pd.concat([pd.read_json(path, orient='index') for path in paths]).reset_index()
# item is a list of dictionaries that can be exploded to separate columns
dfe = df.explode('item').reset_index(drop=True)
# each dictionary now has a separate row
# normalize the dicts, so each key is a column name and each value is in the row
# rename 'name' to 'item_name', this is the column containing file names like .xml
# join this back to the main dataframe and drop the item row
dfj = dfe.join(pd.json_normalize(dfe.item).rename(columns={'name': 'item_name'})).drop(columns=['item'])
# find the rows with .xml in item_name
# groupby name, which is the archive path with CIK and Accession Number
# count the number of xml files
dfg = dfj.item_name[dfj.item_name.str.contains('.xml', case=False)].groupby(dfj.name).count().reset_index().rename(columns={'item_name': 'xml_count'})
# display(dfg)
name xml_count
0 /Archives/edgar/data/1736260/000173626020000004 2
1 /Archives/edgar/data/51143/000104746917001061 6
- print a dataframe with all the xml file names with the corresponding index in the dataframe
print(dfj[['name', 'item_name']][dfj.item_name.str.contains('.xml')].reset_index())
[out]:
index name item_name
0 43 /Archives/edgar/data/1736260/000173626020000004 cpia2ndqtr202013fhr.xml
1 44 /Archives/edgar/data/1736260/000173626020000004 primary_doc.xml
2 66 /Archives/edgar/data/51143/000104746917001061 FilingSummary.xml
3 74 /Archives/edgar/data/51143/000104746917001061 ibm-20161231.xml
4 76 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_cal.xml
5 77 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_def.xml
6 78 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_lab.xml
7 79 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_pre.xml
- create a dataframe with just the xml files and add a column with a full path to those files
xml_files = dfj[dfj.item_name.str.contains('.xml', case=False)].copy()
# add a column that creates a full path to the xml files
xml_files['file_path'] = xml_files[['name', 'item_name']].apply(lambda x: f'https://www.sec.gov{x[0]}/{x[1]}', axis=1)
# disply(xml_files)
index name parent-dir last-modified item_name type size file_path
43 directory /Archives/edgar/data/1736260/000173626020000004 /Archives/edgar/data/1736260 2020-07-24 09:38:30 cpia2ndqtr202013fhr.xml text.gif 72804 https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml
44 directory /Archives/edgar/data/1736260/000173626020000004 /Archives/edgar/data/1736260 2020-07-24 09:38:30 primary_doc.xml text.gif 1931 https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml
66 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 FilingSummary.xml text.gif 91940 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml
74 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231.xml text.gif 11684003 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml
76 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_cal.xml text.gif 185502 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml
77 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_def.xml text.gif 801568 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml
78 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_lab.xml text.gif 1356108 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml
79 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_pre.xml text.gif 1314064 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml
# create a list of just the file paths
path_to_xml_files = xml_files.file_path.tolist()
print(path_to_xml_files)
[out]:
['https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml',
'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml']