3

I'm trying to create a script which makes requests to random urls from a txt file e.g.:

import urllib2

with open('urls.txt') as urls:
    for url in urls:
        try:
            r = urllib2.urlopen(url)
        except urllib2.URLError as e:
            r = e
        if r.code in (200, 401):
            print '[{}]: '.format(url), "Up!"

But I want that when some url indicates 404 not found, the line containing the URL is erased from the file. There is one unique URL per line, so basically the goal is to erase every URL (and its corresponding line) that returns 404 not found. How can I accomplish this?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user1985563
  • 207
  • 1
  • 2
  • 6

2 Answers2

2

You could simply save all the URLs that worked, and then rewrite them to the file:

good_urls = []
with open('urls.txt') as urls:
    for url in urls:
        try:
            r = urllib2.urlopen(url)
        except urllib2.URLError as e:
            r = e
        if r.code in (200, 401):
            print '[{}]: '.format(url), "Up!"
            good_urls.append(url)
with open('urls.txt', 'w') as urls:
    urls.write("".join(good_urls))
David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • +1 but there are other codes and the code handling does not really match what the OP described (probably won't handle redirects properly etc). I guess it should be more like `if r.code != 404:` – wim Jan 24 '13 at 01:25
1

The easiest way is to read all the lines, loop over the saved lines and try to open them, and then when you are done, if any URLs failed you rewrite the file.

The way to rewrite the file is to write a new file, and then when the new file is successfully written and closed, then you use os.rename() to change the name of the new file to the name of the old file, overwriting the old file. This is the safe way to do it; you never overwrite the good file until you know you have the new file correctly written.

I think the simplest way to do this is just to create a list where you collect the good URLs, plus have a count of failed URLs. If the count is not zero, you need to rewrite the text file. Or, you can collect the bad URLs in another list. I did that in this example code. (I haven't tested this code but I think it should work.)

import os
import urllib2

input_file = "urls.txt"
debug = True

good_urls = []
bad_urls = []

bad, good = range(2)

def track(url, good_flag, code):
    if good_flag == good:
        good_str = "good"
    elif good_flag == bad:
        good_str = "bad"
    else:
        good_str = "ERROR! (" + repr(good) + ")"
    if debug:
        print("DEBUG: %s: '%s' code %s" % (good_str, url, repr(code)))
    if good_flag == good:
        good_urls.append(url)
    else:
        bad_urls.append(url)

with open(input_file) as f:
    for line in f:
        url = line.strip()
        try:
            r = urllib2.urlopen(url)
            if r.code in (200, 401):
                print '[{0}]: '.format(url), "Up!"
            if r.code == 404:
                # URL is bad if it is missing (code 404)
                track(url, bad, r.code)
            else:
                # any code other than 404, assume URL is good
                track(url, good, r.code)
        except urllib2.URLError as e:
            track(url, bad, "exception!")

# if any URLs were bad, rewrite the input file to remove them.
if bad_urls:
    # simple way to get a filename for temp file: append ".tmp" to filename
    temp_file = input_file + ".tmp"
    with open(temp_file, "w") as f:
        for url in good_urls:
            f.write(url + '\n')
    # if we reach this point, temp file is good.  Remove old input file
    os.remove(input_file)  # only needed for Windows
    os.rename(temp_file, input_file)  # replace original input file with temp file

EDIT: In comments, @abarnert suggests that there might be a problem with using os.rename() on Windows (at least I think that is what he/she means). If os.rename() doesn't work, you should be able to use shutil.move() instead.

EDIT: Rewrite code to handle errors.

EDIT: Rewrite to add verbose messages as URLs are tracked. This should help with debugging. Also, I actually tested this version and it works for me.

steveha
  • 74,789
  • 21
  • 92
  • 117
  • The problem with this solution is that it doesn't work on Windows. If the OP only cares about Unix, it's by far the simplest and best (especially if you just use `tempfile.NamedTemporaryFile` instead of worrying about what to call it, where to put it, etc.), but it's worth mentioning in case the OP _does_ care about Windows. – abarnert Jan 24 '13 at 01:21
  • @abarnert, why do you say this won't work on Windows? Do you think that Python didn't implement `os.rename()` on Windows, or what? (I have used `os.rename()` on Windows... hmm, I'm using Cygwin, that might matter.) I was on the fence about using `tempfile.NamedTemporaryFile()` but I went with the simple solution of tacking `.tmp` on the end of the existing filename; that way the file is always on the same file system (indeed in the same directory) as the original yet the code is very simple. – steveha Jan 24 '13 at 01:28
  • Oohh, unfortunally I tested if a fake url and the script stop and doesn't continue checking the others urls :X `Traceback (most recent call last): File "test.py", line 16, in if r.code in (200, 401): AttributeError: 'URLError' object has no attribute 'code'` – user1985563 Jan 24 '13 at 01:45
  • I was just trusting your original code there. Your original code must fail the same way. But okay, I'll rewrite it for you. – steveha Jan 24 '13 at 01:47
  • Oh, really thanks and so sorry about it, i'm dumb in python :(. – user1985563 Jan 24 '13 at 01:49
  • @steveha: Quoting [the docs](http://docs.python.org/2/library/os.html#os.rename): "On Windows, if dst already exists, OSError will be raised even if it is a file; there may be no way to implement an atomic rename when dst names an existing file." And using [`shutil.move`](http://docs.python.org/2/library/shutil.html#shutil.move) does not help at all: "If the destination is on the current filesystem, then os.rename() is used." – abarnert Jan 24 '13 at 02:20
  • @steveha: And before you ask "So what do you do instead?", the answer is that you just can't do proper atomic writes if you care about Windows. There are different things you can do instead, all with different downsides, and you just have to pick the one that's most acceptable. – abarnert Jan 24 '13 at 02:23
  • When I got `401 unauthorized` the script delete! How to fix it? I rewrite the code but still deleting, I just want to delete if got `404` – user1985563 Jan 24 '13 at 02:24
  • I will rewrite it one last time. – steveha Jan 24 '13 at 02:27
  • @abarnert, in this case there shouldn't be a race condition and we can just explicitly delete the original file so the rename will then work. Oh, and wow, Windows is broken if there is no guaranteed atomic rename! – steveha Jan 24 '13 at 02:33
  • I tested your new code and again `401 unauthorized` has been deleted :~. – user1985563 Jan 24 '13 at 02:35
  • I'm not sure what to tell you. I don't see how that can happen with the new code. – steveha Jan 24 '13 at 02:40
  • :(! But okay you helped me a lot, you have some documentation which I can read to learn more? – user1985563 Jan 24 '13 at 02:43
  • After this I have to go. But I rewrote the program to print messages as it classifies the URLs as good or bad. Pick a URL that you think should be classified as good, and check the output of this program to see how it was classified by the program. Also, I suggest you get a copy of Wing IDE, because that has a debugger for Python that works like Visual Studio. There is a free version, or you can get a free trial of one of the other versions. http://www.wingware.com/wingide/trial Good luck. – steveha Jan 24 '13 at 04:14
  • @steveha: Yes, this means Windows is broken. Did you also know that opening a file exclusively locks that file _and its directory entry_, so delete will fail if someone else has it open? This means that your relete-then-rename doesn't really solve any race condition; it just means that anywhere people would get a file-not-accessible-because-it's-locked error they instead get a file-not-found error. Seriously, you're not going to think of a simple solution to this problem that nobody has come up with in 17 years of trying; just accept that Windows is broken, and hope you never have to care. – abarnert Jan 24 '13 at 20:13