0

I know that this is a repeated question however from all answers on web I could not find the solution as all throwing error. Simply trying to scrape headers from the web and save them to a txt file. scraping code works well, however, it saves only last string bypassing all headers to the last one. I have tried looping, putting writing code before scraping, appending to list etc, different method of scraping however all having the same issue. please help.

here is my code

def nytscrap():
    from bs4 import BeautifulSoup
    import requests

url = "http://www.nytimes.com"

page = BeautifulSoup(requests.get(url).text, "lxml")

for headlines in page.find_all("h2"):
    print(headlines.text.strip())

filename = "NYTHeads.txt" 
with open(filename, 'w') as file_object:
        file_object.write(str(headlines.text.strip()))

'''

2 Answers2

1

Every time your for loop runs, it overwrites the headlines variable, so when you get to writing to the file, the headlines variable only stores the last headline. An easy solution to this is to bring the for loop inside your with statement, like so:

with open(filename, 'w') as file_object:
    for headlines in page.find_all("h2"):
        print(headlines.text.strip())
        file_object.write(headlines.text.strip()+"\n") # write a newline after each headline
clamchowder314
  • 223
  • 2
  • 8
  • Hi, Thank you very much for the quick response. I have tried that earlier however this throughout an error in the txt file that is not encoded as UTF-8 ? – Spreadsheet Pete May 03 '20 at 15:04
  • Error! C:user\ txt file path... NYTHeads.txt is not YTF-8 encoded Saving disabled. see console for more details. I have tried manually save the file in the correct format and the run it but still the same – Spreadsheet Pete May 03 '20 at 15:17
  • It appears that this is a known issue in Jupyter notebooks, see https://stackoverflow.com/q/35928426 and https://github.com/jupyterhub/jupyterhub/issues/1572. However, it seems that the program does write to the file, it's just throwing an error when trying to read it. – clamchowder314 May 03 '20 at 16:03
  • OMG, you are right, when open the file outside Jupiter its all there!, lost 4 days trying to figure it out. so the corrected code is working, just Jupiter cant read the file... thank you very much for your help – Spreadsheet Pete May 03 '20 at 16:13
0

here is full working code corrected as per advice.

    from bs4 import BeautifulSoup
import requests

def nytscrap():
    from bs4 import BeautifulSoup
    import requests

url = "http://www.nytimes.com"

page = BeautifulSoup(requests.get(url).text, "lxml")
filename = "NYTHeads.txt" 
with open(filename, 'w') as file_object:
    for headlines in page.find_all("h2"):
        print(headlines.text.strip())
        file_object.write(headlines.text.strip()+"\n")

this code will trough error in Jupiter work but when an opening file, however when file open outside Jupiter headers saved...