Collecting similar strings from local html-documents using Python's regex

Question

I'm having issues trying to obtain a bunch of similar strings from a bunch of text (HTML) files using Python's regex. The files in question look something like this:

<!DOCTYPE html>
<html lang=fi>
<head>
<meta charset=UTF-8>
<title>Peruslaskutoimituksia</title>
<link rel=stylesheet type="text/css" href=
"https://math.tut.fi/mathcheck/mathcheck.css">
<script type="text/javascript" src=
"https://math.tut.fi/mathcheck/MathJax/MathJax.js?config=AM_HTMLorMML">
</script>
</head>

<body>
<h1>Peruslaskutoimituksia</h1>

<p>Kirjoita vastauksesi yhtäsuuruusmerkin jälkeen.

<form action="https://math.tut.fi/mathcheck/cgi-bin/mathcheck.out" method=post
target=_blank>
<table class=ifr>
<tr><td class=ifrl>

<p>1. Määritä luvun ` 9 ` käänteisluku.

<textarea name="hidden" style=display:none>
verbose_off
/* Tehtävä 1 */
arithmetic
f_nodes 4
1/( 9 )
</textarea>

<p><textarea rows=2 cols=30 autofocus name="exam">= </textarea>
<textarea name="hidden" style=display:none>end_of_answer</textarea>

<input type=submit formtarget=feedback1 value="submit, to the right">
</td><td class=ifrr>
<iframe name=feedback1 height=200></iframe></td></tr>
</table>
</form>

<form action="https://math.tut.fi/mathcheck/cgi-bin/mathcheck.out" method=post
target=_blank>
<table class=ifr>
<tr><td class=ifrl>

<p>2. Määritä luvun ` -4 ` vastaluku.

<textarea name="hidden" style=display:none>
verbose_off
/* Tehtävä 2 */
arithmetic
f_nodes 2
(-1)*( -4 )
</textarea>

<p><textarea rows=2 cols=30 name="exam">= </textarea>
<textarea name="hidden" style=display:none>end_of_answer</textarea>

<input type=submit formtarget=feedback2 value="submit, to the right">
</td><td class=ifrr>
<iframe name=feedback2 height=200></iframe></td></tr>
</table>
</form>

<form action="https://math.tut.fi/mathcheck/cgi-bin/mathcheck.out" method=post
target=_blank>
<table class=ifr>
<tr><td class=ifrl>

<p>3. Laske lausekkeen ` 7 /d^ 3 + 4 ` arvo, kun `d=3`. Anna vastaus
kokonaislukuna tai murtolukuna `a/b`, missä `a` ja `b` ovat kokonaislukuja.

<textarea name="hidden" style=display:none>
verbose_off
/* Tehtävä 3 */
arithmetic
f_nodes 3
7 /3^ 3 + 4
</textarea>

<p><textarea rows=2 cols=30 name="exam">= </textarea>
<textarea name="hidden" style=display:none>end_of_answer</textarea>

<input type=submit formtarget=feedback3 value="submit, to the right">
</td><td class=ifrr>
<iframe name=feedback3 height=200></iframe></td></tr>
</table>
</form>

<form action="https://math.tut.fi/mathcheck/cgi-bin/mathcheck.out" method=post
target=_blank>
<table class=ifr>
<tr><td class=ifrl>

<p>4. Laske lausekkeen `x^ 4 + x^ 3 ` arvo, kun `x = 3`.

<textarea name="hidden" style=display:none>
verbose_off
/* Tehtävä 4 */
arithmetic
f_nodes 1
3^ 4 + 3^ 3
</textarea>

<p><textarea rows=2 cols=30 name="exam">= </textarea>
<textarea name="hidden" style=display:none>end_of_answer</textarea>

<input type=submit formtarget=feedback4 value="submit, to the right">
</td><td class=ifrr>
<iframe name=feedback4 height=200></iframe></td></tr>
</table>
</form>

<form action="https://math.tut.fi/mathcheck/cgi-bin/mathcheck.out" method=post
target=_blank>
<table class=ifr>
<tr><td class=ifrl>

<p>5. Laske lausekkeen `( 8 )/( 5 ): ( 2 )/( 3 )` arvo murtolukuna `a/b`, missä
`a` ja `b` ovat kokonaislukuja.

<textarea name="hidden" style=display:none>
verbose_off
/* Tehtävä 5 */
arithmetic
f_nodes 3
(( 8 )/( 5 ))/(( 2 )/( 3 ))
</textarea>

<p><textarea rows=2 cols=30 name="exam">= </textarea>
<textarea name="hidden" style=display:none>end_of_answer</textarea>

<input type=submit formtarget=feedback5 value="submit, to the right">
</td><td class=ifrr>
<iframe name=feedback5 height=200></iframe></td></tr>
</table>
</form>

<hr>
<p class=unimp>This file was generated 2018-07-06 11:23:11 UTC.

</body>
</html>

The specific strings I'm trying to obtain are the ones contained between the tags

<p>(some number). ...</textarea>,

for example

<p>1. Määritä luvun ` 9 ` käänteisluku.

<textarea name="hidden" style=display:none>
verbose_off
/* Tehtävä 1 */
arithmetic
f_nodes 4
1/( 9 )
</textarea>

However, calling Python's regex's findall-function as follows

"""
Reading the assignments into memory from the produced HTML-files
"""
htmlDirEntries = sorted([filename for filename in os.listdir("./HTMLfiles") if filename.endswith(".html")])
# print(htmlDirEntries)

HTMLdict = {}
for filename in htmlDirEntries:
    print(f"Copying contents of {filename}...")
    with open(f"./HTMLfiles/{filename}", 'r') as f:
        filecontents = f.read()
        HTMLdict[filename] = filecontents
    print("Done.")

print()

print("Looking for and collecting assignments...")
print()

for filename in HTMLdict:
    print(filename)
    assignments = re.findall("<p>\d.+.*</textarea>", HTMLdict[filename], re.DOTALL)
    i = 1
    for assignment in assignments:
        print(f"Assignment {i}")
        i += 1
        print(assignment)

print("Done.")

does not return the desired output, which should be a list of strings looking something like this:

'<p>1. Määritä luvun ` 7 ` käänteisluku.\n\n<textarea name="hidden" style=display:none>\nverbose_off\n/* Tehtävä 1 */\narithmetic\nf_nodes 4\n1/( 7 )\n</textarea>'

What is returned instead is the contents of the entire file starting from the first <p>1. .... I'm guessing my use of re.findall is not correct.

What I'm wondering is, why does the findall-method not return the string ending at the first </textarea> after the starting <p>? What am I doing wrong here? My first inkling would be to look at the regular expression given to findall, namely "<p>\d.+.*</textarea>"...

EDIT:

The following expression returns what I want, although it doesn't work in all cases:

assignments = re.findall(r"<p>\d+.+\n\n.+\n.+\n.+\n.+\n.+\n.+\n</textarea>", HTMLdict[filename])

With some files this returns an empty list, even though I'm pretty sure they have the same format.

EDIT2

Turns out the files didn't have the same format. There are extra newlines in some of the files between the starting <p>-tag and <textarea...

score 0 · Answer 1 · answered Jul 06 '18 at 13:00

Turns out this had everything to do with the greediness of the *- and +-operators. Adding a ?-operator after either of those fixed the issue:

assignments = re.findall("<p>\d+.[\s\S]+?</textarea>", HTMLdict[filename])

Here the expression [\s\S] refers to all possible characters, including newlines. I wanted as many of these as possible, while at the same time receiving as few </textarea>-tags as possible, which required the use of the limiting ?-operator.

Collecting similar strings from local html-documents using Python's regex

EDIT:

EDIT2

1 Answers1