scraping 1a. risk factors from 10K files

Question

I am trying to get 1a. Risk factors section from each 10-K file. I already downloaded files and saved them as txt. file.

```'/content/drive/My Drive/Colab Notebooks/10/BKR/1.txt'
'/content/drive/My Drive/Colab Notebooks/10/BKR/2.txt'```

As such, folder 10 contains several subfolders(like 10), and each subfolder(like BKR) contains several 10-K as txt file.

I tried below code to get 1a.Risk Factors section, but it failed. I would be happy if you could share your opinions.

```import re
import os, os.path

PATH = '/content/drive/My Drive/Colab Notebooks/10/BKR'

conclusions = []
for file in os.listdir(path):
    with open(os.path.join(PATH, file)) as f:
        data = f.read()

    conclusion = re.search('1a: (.*?)([A-Z]{2,})', data).group(1)
    conclusions.append(conclusion)```

The error message I got:

```

---------------------------------------------------------------------------

NotADirectoryError                        Traceback (most recent call last)

<ipython-input-12-051ca10fbeb3> in <module>()
      5 
      6 conclusions = []
----> 7 for file in os.listdir(path):
      8     with open(os.path.join(PATH, file)) as f:
      9         data = f.read()

NotADirectoryError: [Errno 20] Not a directory: '/content/drive/My Drive/Colab Notebooks/10/APA/1.txt

'```

seem slike you use lowercase path rather than uppercase PATH:for file in os.listdir(path):, should it not be: for file in os.listdir(PATH):? — Je Je, Jun 04 '20 at 17:38
Thanks Nono. I changed. Now, the error I get: `--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () 9 data = f.read() 10 ---> 11 conclusion = re.search('1A: (.*?)([A-Z]{2,})', data).group(1) 12 conclusions.append(conclusion) AttributeError: 'NoneType' object has no attribute 'group'` — patach, Jun 04 '20 at 17:42
maybe udpate your question and provide a url where we can get a file with 1a Risk fatcor to search in. — Je Je, Jun 04 '20 at 17:44
here comes the risk factors: https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm#ITEM_1A_RISK_FACTORS — patach, Jun 04 '20 at 17:50
so the doc you save in txt is in fact an html, right? would you consider using beautifulsoup or need to be with re? — Je Je, Jun 04 '20 at 17:56

scraping 1a. risk factors from 10K files

0 Answers0