-1

I want to download this file to my local drive: https://www.sec.gov/Archives/edgar/data/1556179/0001104659-20-000861.txt

Here are my codes:

import requests
import urllib
from bs4 import BeautifulSoup
import re
  
path=r"https://www.sec.gov/Archives/edgar/data/1556179/0001104659-20-000861.txt" 
r=requests.get(path, headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
soup=str(soup)
lines=soup.split("\\n")

dest_url=r"C://Users/YL/Downloads/a.txt"
fx=open(dest_url,'w')
for line in lines:
    fx.write(line + '\n')

Here is the error message: enter image description here

How should I download the file then? Thanks a lot!

Julie
  • 57
  • 4
  • Please [don’t post images of code, error messages, or other textual data.](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors) – tripleee Jan 17 '22 at 12:05
  • The EDGAR data famously contains Unicode errors; that's the root cause of your problem. – tripleee Jan 17 '22 at 12:06
  • I _know_ there is a duplicate but I can't find it. Basically, the EDGAR people seem to have invented their own bastard version of UTF-8 (or was it Windows-1252?) which isn't compatible with any real encoding; you have to find the offending bytes and replace them with the correct ones. It's a mechanical change once you see what's wrong. Search for Python questions about encoding errors with an answer by MartijnPieters (I think it was?) and comments by myself, or perhaps vice versa. – tripleee Jan 17 '22 at 12:22
  • The URL in your example seems to contain completely correct character codes, though. The immediate problem seems to be that `soup = str(soup)` is not a good idea. Did you mean `soup = soup.text` perhaps? – tripleee Jan 17 '22 at 12:29

2 Answers2

1

The download is fine. The problem is that str(soup) is not well-defined, and throws html5lib into an endless loop. You probably meant

soup = soup.text

which (crudely) extracts the actual readable text from the BeatifulSoup object.

tripleee
  • 175,061
  • 34
  • 275
  • 318
0

Your file has downloaded alright; it seems there's a problem with BeautifulSoup's parsing. Try changing the parser and doing it this way:

path=r"https://www.sec.gov/Archives/edgar/data/1556179/0001104659-20-000861.txt" 
r=requests.get(path, headers={"User-Agent": "b2g"})
soup=BeautifulSoup(r.text, "html.parser")
soup

and you'll see the file is there.

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • I replaced my codes with the three lines you have written, but the error message is "TypeError: 'NoneType' object is not callable". Could you please help me to fix the problem? Thank you! – Julie Jan 17 '22 at 03:04
  • @Julie Not sure why you're getting that error. I edited the answer using the full code (which worked for me). Try it now and see if it works. – Jack Fleeting Jan 17 '22 at 11:40