Reading in parts of file, stopping and starting with certain words

Question

I'm using python 2.7, and I have been assigned (self-directed assignment, I wrote these instructions) to write a small static html generator, and I would like assistance finding new-to-python oriented resources for reading portions of files at a time. If someone provides code answers, that's great, but I want to understand why and how python works. I can buy books, but not expensive ones- I can afford to put thirty, maybe forty dollars into this specific research at the moment.

The way this program is supposed to work is that there is a template.html file, a message.txt file, an image file, an archive.html file, and an output.html file. This is more information than you need, but the basic idea I had was "go back and forth reading from template and message, putting their contents in output and then writing in archive that output exists". But I haven't got there yet, and I'm not asking you to solve this entire problem, as I detail below:

The program reads in html from template.html, stopping at the opening tag, then reads in what the title of the page is going to be from message.txt. That's where I am now. It works! I was so happy... hours ago, when I realized that was not the final boss.

#doctype to title
copyLine = False
for line in template.readlines():
    if not '<title>' in line:
       copyLine = True
       if copyLine:
            outputhtml.write(line)
            copyLine = False
else:
    templateSeek = template.tell()
    break

#read name of message
titleOut = message.readline()
print titleOut, " is the title of the new page"
#--------
##5. Put the title from the message file in the head>title tag of the output file
#--------
titleOut = str(titleOut)
titleTag = "<title>"+titleOut+"|Circuit Salsa</title>"
outputhtml.write(titleTag)

My problem is this: I don't understand regular expressions, and when I try various forms of for...in codes, I get all of the template, none of the template, some combination of the parts of the template I didn't want... anyway, how do I go back and forth reading these files and pick up where I left off? Any assistance finding easier-to-understand resources is greatly appreciated, I've spent about five hours researching this and I'm getting a headache, because I keep getting resources aimed at more advanced audiences and I don't understand them.

These are the last two methods I tried (with no success):

block = ""
found = False
print "0"
for line in template:
    if found:
        print "1"
        block += line
        if line.strip() == "<h1>": break
else:
    if line.strip() == "</title>":
        print "2"
        found = True
        block = "</title>"

print block + "3"

only points 0 and 3 got printed. I put the print # there because I couldn't figure out why my output file was unchanged.

template.seek(templateSeek)
copyLine = False
for line in template.readlines():
    if not '<a>' in line:
        copyLine = True
        if copyLine:
            outputhtml.write(line)
            copyLine = False
    else:
        templateSeek = template.tell()
        break

With the other one, I'm pretty sure I'm just doing it all wrong.

nothing successful, I think I was trying to have it pick up where it left off by using tell at the end of the successful read so that I could seek to it later. — NMacKenzie, Apr 20 '15 at 02:15

score 3 · Answer 1 · answered Apr 19 '15 at 23:58

I would use BeautifulSoup for this. An alternative is to use regular expressions, which are good to know anyway. I know they look quite intimidating, but they're actually not that difficult to learn (it took me an hour or so). For example to get all of the link tags you can do something like

from re import findall, DOTALL

html = '''
<!DOCTYPE html>
<html>

<head>
    <title>My awesome web page!</title>
</head>

<body>
    <h2>Sites I like</h2>
    <ul>
        <li><a href="https://www.google.com/">Google</a></li>
        <li><a href="https://www.facebook.com">Facebook</a></li>
        <li><a href="http://www.amazon.com">Amazon</a></li>
    </ul>

    <h2>My favorite foods</h2>
    <ol>
        <li>Pizza</li>
        <li>French Fries</li>
    </ol>
</body>

</html>
'''

def find_tag(src, tag):
    return findall(r'<{0}.*?>.*?</{0}>'.format(tag), src, DOTALL)

print find_tag(html, 'a')
# ['<a href="https://www.google.com/">Google</a>', '<a href="https://www.facebook.com">Facebook</a>', '<a href="http://www.amazon.com">Amazon</a>']
print find_tag(html, 'li')
# ['<li><a href="https://www.google.com/">Google</a></li>', '<li><a href="https://www.facebook.com">Facebook</a></li>', '<li><a href="http://www.amazon.com">Amazon</a></li>', '<li>Pizza</li>', '<li>French Fries</li>']
print find_tag(html, 'body')
# ['<body>\n    <h2>Sites I like</h2>\n    <ul>\n        <li><a href="https://www.google.com/">Google</a></li>\n        <li><a href="https://www.facebook.com">Facebook</a></li>\n        <li><a href="http://www.amazon.com">Amazon</a></li>\n    </ul>\n\n    <h2>My favorite foods</h2>\n    <ol>\n        <li>Pizza</li>\n        <li>French Fries</li>\n    </ol>\n</body>']

I hope that you find at least some of this useful. If you have any follow up questions, please comment on my answer. Good luck!

The page you linked about regular expressions is helpful, thank you. If there is anything else you would recommend to a newcomer to it, lay it on me. Additionally, I did see BeautifulSoup, and I will likely use it in a later version of this program - but while this is a self-directed assignment I do have to turn it in to someone else later and I don't want to try to use something that he might not have installed. — NMacKenzie, Apr 20 '15 at 01:32

score 3 · Answer 2 · answered Apr 19 '15 at 23:59

In your first attempt you have an indentation problem. The else clause is at the same indent level as the for statement, therefore together they form the compound for:else: control structure. New Python programmers are often confused by this. The else: clause only executes if the for loop runs to the end without encountering a break statement. Apparently in your case the break statement does get executed, thus the else: clause does not. The else: clause is outside the loop, so "found" never gets set to True. I think if you indent the else: clause you will like the result. Also I think you could drop the calls to strip() and instead use statements like "if '' in line:" etc.

I suspect you're right about the second function. It makes no sense to me at all.

score 0 · Accepted Answer · answered Apr 29 '15 at 20:08

Late last night, I came across a solution that worked for what I was trying to do. While learning regular expressions will be a useful skill that I will definitely cultivate over the summer, regex was a little much for this particular application. I ended up using linecache to read in specific lines, since the information I wanted out of these files was delimited by the newline.

Reading in parts of file, stopping and starting with certain words

3 Answers3