4

I have this where it reads a file called source1.html, source2.html, source3.html, but when it cant find the next file (because it doesnt exist) it gives me a error. there can be an x amount of sourceX.html, so i need something to say if the next sourcex.html file can not be found, stop the loop.

Traceback (most recent call last): File "main.py", line 14, in file = open(filename, "r") IOError: [Errno 2] No such file or directory: 'source4.html

how can i stop the script looking for the next source file?

from bs4 import BeautifulSoup
import re
import os.path

n = 1
filename = "source" + str(n) + ".html"
savefile = open('OUTPUT.csv', 'w')

while os.path.isfile(filename):

    strjpgs = "Extracted Layers: \n \n"
    filename = "source" + str(n) + ".html"
    n = n + 1
    file = open(filename, "r")
    soup = BeautifulSoup(file, "html.parser")
    thedata = soup.find("div", class_="cplayer")
    strdata = str(thedata)
    DoRegEx = re.compile('/([^/]+)\.jpg')
    jpgs = DoRegEx.findall(strdata)
    strjpgs = strjpgs + "\n".join(jpgs) + "\n \n"
    savefile.write(filename + '\n')
    savefile.write(strjpgs)

    print(filename)
    print(strjpgs)

savefile.close()
print "done"
Chagger
  • 45
  • 4

4 Answers4

2

use a try / except and break

while os.path.isfile(filename):
    try:  # try to do this
         # <your code>
    except FileNotFoundError:  # if this error occurs
         break  # exit the loop

The reason your code doesn't currently work is you're checking the previous file exists in your while loop. Not the next one. Hence you could also do

 while True:
     strjpgs = "Extracted Layers: \n \n"
     filename = "source" + str(n) + ".html"
     if not os.path.isfile(filename):
          break
     # <rest of your code>
FHTMitchell
  • 11,793
  • 2
  • 35
  • 47
1

you can try opening file, and break out of while loop once you catch an IOError exception.

from bs4 import BeautifulSoup
import re
import os.path

n = 1
filename = "source" + str(n) + ".html"
savefile = open('OUTPUT.csv', 'w')

while os.path.isfile(filename):

    try:
      strjpgs = "Extracted Layers: \n \n"
      filename = "source" + str(n) + ".html"
      n = n + 1
      file = open(filename, "r")
    except IOError:
      print("file not found! breaking out of loop.")
      break

    soup = BeautifulSoup(file, "html.parser")
    thedata = soup.find("div", class_="cplayer")
    strdata = str(thedata)
    DoRegEx = re.compile('/([^/]+)\.jpg')
    jpgs = DoRegEx.findall(strdata)
    strjpgs = strjpgs + "\n".join(jpgs) + "\n \n"
    savefile.write(filename + '\n')
    savefile.write(strjpgs)

    print(filename)
    print(strjpgs)

savefile.close()
print "done"
Ali Yılmaz
  • 1,657
  • 1
  • 11
  • 28
  • this is like a brute force solution, working but the problem is in the logic itself – Netwave May 11 '18 at 10:13
  • Always try to narrow your `try` blocks. This will completely hide *why* some `IOError` occurred - such as if it was in the `savefile.write` calls. Edit: And don't forget to move the code outside the `try` rather than into the `except`. – Yann Vernier May 11 '18 at 10:50
  • @Yann Vernier youre right. I was about to mention that with an edit. I tried to minimize the try block. – Ali Yılmaz May 11 '18 at 11:21
  • 1
    This also didn't remove the sequence error in updating `filename`, so the `isfile` test remains a bit moot. But catching the exception is still the cleaner method, for instance if someone deletes the file in the moment between `isfile` and `open`. – Yann Vernier May 11 '18 at 11:24
0

This appears to be a sequence error. Let's look at a small fragment of your code, specifically lines dealing with filename:

filename = "source" + str(n) + ".html"

while os.path.isfile(filename):

    filename = "source" + str(n) + ".html"
    n = n + 1
    file = open(filename, "r")

You're generating the next filename before you open the file (or really, checking the old filename then opening a new one). It's a little hard to see because you're really updating n while filename holds the previous number, but if we look at them in sequence it pops out:

n = 1
filename = "source1.html"   # before loop
while os.path.isfile(filename):
 filename = "source1.html"   # first time inside loop
 n = 2
 open(filename)
while os.path.isfile(filename):  # second time in loop - still source1
 filename = "source2.html"
 n = 3
 open(filename)    # We haven't checked if this file exists!

We can fix this a few ways. One is to move the entire updating, n before filename, to the end of the loop. Another is to let the loop mechanism update n, which is a sight easier (the real fix here is that we only use one filename value in each iteration of the loop):

for n in itertools.count(1):
    filename = "source{}.html".format(n)
    if not os.path.isfile(filename):
        break
    file = open(filename, "r")
    #...

At the risk of looking rather obscure, we can also express the steps functionally (I'm using six here to avoid a difference between Python 2 and 3; Python 2's map wouldn't finish):

from six.moves import map
from itertools import count, takewhile

numbers = count(1)
filenames = map('source{}.html'.format, numbers)
existingfiles = takewhile(os.path.isfile, filenames)

for filename in existingfiles:
    file = open(filename, "r")
    #...

Other options include iterating over the numbers alone and using break when isfile returns False, or simply catching the exception when open fails (eliminating the need for isfile entirely).

Yann Vernier
  • 15,414
  • 2
  • 28
  • 26
0

I'll suggest you to use os.path.exists() (which returns True/False) and os.path.isfile() both.

Use with statement to open file. It is Pythonic way to open files.

with statement is best preferred among the professional coders.

These are the contents of my current working directory.

H:\RishikeshAgrawani\Projects\Stk\ReadHtmlFiles>dir
 Volume in drive H is New Volume
 Volume Serial Number is C867-828E

 Directory of H:\RishikeshAgrawani\Projects\Stk\ReadHtmlFiles

11/05/2018  16:12    <DIR>          .
11/05/2018  16:12    <DIR>          ..
11/05/2018  15:54               106 source1.html
11/05/2018  15:54               106 source2.html
11/05/2018  15:54               106 source3.html
11/05/2018  16:12                 0 stopReadingIfNot.md
11/05/2018  16:11               521 stopReadingIfNot.py
               5 File(s)            839 bytes
               2 Dir(s)  196,260,925,440 bytes free

The below Python code shows how will you read files source1.html, source2.html, source.3.html and stop if there is no more files of the form sourceX.html (where X is 1, 2, 3, 4, ... etc.).

Sample code:

import os

n = 1;
html_file_name = 'source%d.html'

# It is necessary to check if sourceX.html is file or directory.
# If it is directory the check it if it exists or not.
# It it exists then perform operation (read/write etc.) on file.
while os.path.isfile(html_file_name % (n)) and os.path.exists(html_file_name % (n)):
    print "Reading ", html_file_name % (n)

    # The best way (Pythonic way) to open file
    # You don't need to bother about closing the file
    # It will be taken care by with statement
    with open(html_file_name % (n), "r") as file:
        # Make sure it works
        print html_file_name % (n), " exists\n"; 

    n += 1;

Output:

H:\RishikeshAgrawani\Projects\Stk\ReadHtmlFiles>python stopReadingIfNot.py
Reading  source1.html
source1.html  exists

Reading  source2.html
source2.html  exists

Reading  source3.html
source3.html  exists

So based on the above logic. you can modify your code. It will work.

Thanks.

hygull
  • 8,464
  • 2
  • 43
  • 52
  • 1
    Why would `exists` be better here? It will just generate a different error when encountering directories, sockets, device nodes and such. – Yann Vernier May 11 '18 at 11:00
  • 2
    Honestly, neither `isfile` nor `exists` are necessary - the question is if the `open` works. Checking at different times is a potential race condition. Catching the exception is the better way to go. Anything that `isfile` also `exists`, so using both is even more redundant. – Yann Vernier May 11 '18 at 11:10
  • **@YannVernier**, thanks for suggesting to improve my answer. If `sourceX.html` will be a directory then definitely code will fail. So I improved my code. Now it will work fine. Thanks again. – hygull May 11 '18 at 11:11