Saving a "for loop" iteration

Question

When I run the code below, the for loop saves the first text correctly into a separate file, but the second iteration saves the first AND the second into another separate file, and the third iteration saves the first, second and third into a separate file and so on.... I'd like to save each iteration into a separate file but not adding the previous iterations. I don't have a clue to what I'm missing here. Can anyone help, please?

import requests

from bs4 import BeautifulSoup

import pandas as pd

base_url = 'http://www.chakoteya.net/StarTrek/'

end_url = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm', 
          '5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']

episodes = []

count = 0

   for end_url in end_url:

             url = base_url + end_url

             r = requests.get(url)

             soup = BeautifulSoup(r.content, 'html.parser')

             episodes.append(soup.text)

             file_text = open(f"./{count}.txt", "w")

             file_text.writelines()

             file_text.close()

             count = count + 1

             print(f"saved file for url:{url}")

Could be good to use a different name for loop url as : for a_url in end_url... And also is there something missing in file_text.writelines() ? — Ptit Xav, Apr 11 '21 at 14:22
Yes, I was missing some stuffs. You are right! Thank you very much!! — Cygnus X-1, Apr 11 '21 at 20:49

pcauthorn · Answer 1 · 2021-04-11T14:31:33.467

It doesn't appear that your code would save anything to the files at all as you are calling writelines with no arguments

if __name__ == '__main__':
    import requests
    from bs4 import BeautifulSoup

    base_url = 'http://www.chakoteya.net/StarTrek/'
    paths = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
             '5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']

    for path in paths:
        url = f'{base_url}{path}'
        filename = path.split('.')[0]
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')

        with open(f"./{filename}.txt", "w") as f:
            f.write(soup.text)

        print(f"saved file for url:{url}")

This is reworked a little. It wasn't clear why the data was appending to episodes so that was left off.

Maybe you were writing the list to the file which would account for dupes. You were adding the content to each file to a list and writing that growing list each iteration.

Thank you very much indeed, @pcauthorn. I've learned that the writlines method receives a list of string, whereas the write method receives a string. Curiously, it was saving. Anyway, thank you very much again!! — Cygnus X-1, Apr 11 '21 at 14:27
@AndréAlencar glad you figured it out. I fixed it to write in this answer as well — pcauthorn, Apr 11 '21 at 14:31

score 0 · Answer 2 · answered Apr 11 '21 at 14:25

You needed to empty your episodes for each iteration. Try the following:

import requests

from bs4 import BeautifulSoup

import pandas as pd

base_url = 'http://www.chakoteya.net/StarTrek/'

end_url = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
           '5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']

count = 0

for end_url in end_url:
    episodes = []
    url = base_url + end_url

    r = requests.get(url)

    soup = BeautifulSoup(r.content, 'html.parser')

    episodes.append(soup.text)

    file_text = open(f"./{count}.txt", "w")

    file_text.writelines(episodes)

    file_text.close()

    count = count + 1

    print(f"saved file for url:{url}")

Indeed, never thought abou that in the first place!! Totally right!! Thank you very much!!! I appreciate your help a lot!! — Cygnus X-1, Apr 11 '21 at 20:47

αԋɱҽԃ αмєяιcαη · Accepted Answer · 2021-04-11T16:31:02.247

0

Please consider the following points!

there's no reason at all to use bs4! since response.text is actually holding the same.
You've to use Same Session explained on my previous answer
You can use iteration with fstring/format which will let your code more cleaner and easier to read.
with context manager is less headache as you don't need to remember to close your file after!

import requests

block = [9, 13, 14, 15]


def main(url):
    with requests.Session() as req:
        for page in range(1, 17):
            if page not in block:
                print(f'Extracing Page# {page}')
                r = req.get(url.format(page))
                with open(f'{page}.htm', 'w') as f:
                    f.write(r.text)


main('http://www.chakoteya.net/StarTrek/{}.htm')

edited Apr 11 '21 at 16:31

answered Apr 11 '21 at 16:25

αԋɱҽԃ αмєяιcαη

11,825
3
17
50

Yes, you're right!! No need to use bs4!! Thank you very much, now I'll add your info to my noteboook, it's helped me a lot to better understand the possibilities I have at my disposal!! Thank you again!! – Cygnus X-1 Apr 11 '21 at 20:46
Of course, your answer has opened my eyes to other smart appoaches!! thank you!! – Cygnus X-1 Apr 13 '21 at 00:15
@CygnusX-1 this is an old answer but you didn't yet accepted it too – αԋɱҽԃ αмєяιcαη Aug 01 '21 at 20:13

Saving a "for loop" iteration

3 Answers3