0

OK, my coding is very rusty so I've been borrowing and adapting from tutorials.

I started playing around with BeautifulSoup opening a file with:

with open('event.html', encoding='utf8') as f:
    soup = bs4.BeautifulSoup(f, "lxml")

Later, I need to find a string in the same file and BS seemed more complicated so I did:

lines = f.readlines()

And put it together with the previous instructions:

with open('event.html', encoding='utf8') as f:
    soup = bs4.BeautifulSoup(f, "lxml")
    lines = f.readlines()

Where I'm puzzled is that if I swap two lines and make that block like below:

with open('event.html', encoding='utf8') as f:
    lines = f.readlines()
    soup = bs4.BeautifulSoup(f, "lxml")

Then the rest of my code will break. Why is it?

greye
  • 8,921
  • 12
  • 41
  • 46
  • the first one works – greye May 15 '17 at 13:55
  • 3
    because .readlines() advances the file pointer to the end of the file So when BS tries to read the pointer is at the end of the file – Dan-Dev May 15 '17 at 13:57
  • so, should I use a different/better method to extract the lines? – greye May 15 '17 at 14:00
  • 1
    you can reset the pointer to the start of the file as per user3381590 answer or see http://stackoverflow.com/questions/10201008/using-readlines-twice-in-a-row-in-python – Dan-Dev May 15 '17 at 14:03
  • order is unimportant for me but I was banging my head wondering why the code wasn't working and then even more confused when I figured out re-ordering that portion "fixed" it... if anyone has a suggestion I'll take it – greye May 15 '17 at 14:04
  • strike that. the first one doesn't work, it just doesn't crash the script but len(lines) = 0. I followed the f.seek(0) suggestion and now it is ok. – greye May 15 '17 at 14:11

1 Answers1

2

The readlines function causes the internal file pointer to point to the end of the file. I haven't used BeautifulSoup myself but I assume they are assuming that the input file is at pointed at the 0th index in the file. Seeking the file to the beginning using f.seek(0) should alleviate that.

with open('event.html', encoding='utf8') as f:
    lines = f.readlines()
    f.seek(0)
    soup = bs4.BeautifulSoup(f, "lxml")

BeautifulSoup is probably reading the file and then setting the file pointer back to where it was after finishing the read, which is why it is working the other way around.

  • From my tests I believe neither BeautifulSoup nor readlines() set the pointer back. If the other runs first, BS will crash the script but readlines() will simply return empty and move on. Your f.seek(0) fixes this. Thanks! – greye May 15 '17 at 14:12
  • If BS does not set the pointer back, then `lines` should be an empty list when f.readlines is called. – user3381590 May 15 '17 at 14:13