1

When I run the code below, the for loop saves the first text correctly into a separate file, but the second iteration saves the first AND the second into another separate file, and the third iteration saves the first, second and third into a separate file and so on.... I'd like to save each iteration into a separate file but not adding the previous iterations. I don't have a clue to what I'm missing here. Can anyone help, please?

import requests

from bs4 import BeautifulSoup

import pandas as pd

base_url = 'http://www.chakoteya.net/StarTrek/'

end_url = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm', 
          '5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']

episodes = []

count = 0

   for end_url in end_url:

             url = base_url + end_url

             r = requests.get(url)

             soup = BeautifulSoup(r.content, 'html.parser')

             episodes.append(soup.text)

             file_text = open(f"./{count}.txt", "w")

             file_text.writelines()

             file_text.close()

             count = count + 1

             print(f"saved file for url:{url}")
Cygnus X-1
  • 41
  • 5
  • 2
    Could be good to use a different name for loop url as : for a_url in end_url... And also is there something missing in file_text.writelines() ? – Ptit Xav Apr 11 '21 at 14:22
  • Yes, I was missing some stuffs. You are right! Thank you very much!! – Cygnus X-1 Apr 11 '21 at 20:49

3 Answers3

0

It doesn't appear that your code would save anything to the files at all as you are calling writelines with no arguments

if __name__ == '__main__':
    import requests
    from bs4 import BeautifulSoup

    base_url = 'http://www.chakoteya.net/StarTrek/'
    paths = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
             '5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']

    for path in paths:
        url = f'{base_url}{path}'
        filename = path.split('.')[0]
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')

        with open(f"./{filename}.txt", "w") as f:
            f.write(soup.text)

        print(f"saved file for url:{url}")

This is reworked a little. It wasn't clear why the data was appending to episodes so that was left off.

Maybe you were writing the list to the file which would account for dupes. You were adding the content to each file to a list and writing that growing list each iteration.

pcauthorn
  • 370
  • 1
  • 8
  • Thank you very much indeed, @pcauthorn. I've learned that the writlines method receives a list of string, whereas the write method receives a string. Curiously, it was saving. Anyway, thank you very much again!! – Cygnus X-1 Apr 11 '21 at 14:27
  • @AndréAlencar glad you figured it out. I fixed it to write in this answer as well – pcauthorn Apr 11 '21 at 14:31
0

You needed to empty your episodes for each iteration. Try the following:

import requests

from bs4 import BeautifulSoup

import pandas as pd

base_url = 'http://www.chakoteya.net/StarTrek/'

end_url = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
           '5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']

count = 0

for end_url in end_url:
    episodes = []
    url = base_url + end_url

    r = requests.get(url)

    soup = BeautifulSoup(r.content, 'html.parser')

    episodes.append(soup.text)

    file_text = open(f"./{count}.txt", "w")

    file_text.writelines(episodes)

    file_text.close()

    count = count + 1

    print(f"saved file for url:{url}")
anisoleanime
  • 409
  • 3
  • 10
  • Indeed, never thought abou that in the first place!! Totally right!! Thank you very much!!! I appreciate your help a lot!! – Cygnus X-1 Apr 11 '21 at 20:47
0

Please consider the following points!

  1. there's no reason at all to use bs4! since response.text is actually holding the same.
  2. You've to use Same Session explained on my previous answer
  3. You can use iteration with fstring/format which will let your code more cleaner and easier to read.
  4. with context manager is less headache as you don't need to remember to close your file after!
import requests

block = [9, 13, 14, 15]


def main(url):
    with requests.Session() as req:
        for page in range(1, 17):
            if page not in block:
                print(f'Extracing Page# {page}')
                r = req.get(url.format(page))
                with open(f'{page}.htm', 'w') as f:
                    f.write(r.text)


main('http://www.chakoteya.net/StarTrek/{}.htm')