3

So I need to extract some brainfuck code from some HTML and what I have been doing so far isn't working. The HTML looks like this

<div class="style7" style="text-align: justify; overflow: auto;">
        <br />++++++++++[>++++++++++++>+++++++++++>++++++++++++>+++++++++++>++++++++++>++++++++++++>++++++++++>++++++++++>+++++++++++>+++++++++++>++++++++++>++++++++++++<<<<<<<<<<<<-]>-----.>++++.>---.>-.>+++.>+.>+++.>++.>+.>---.>-.>-----.<br /><br /><br />
</div>

I am using Python and BeautifulSoup. I can grab the div just fine from the whole document but I can't seem to get the entirety of the brainfuck from between the
tags.

How would I go about doing it? Thanks

EDIT:

After looking through what BeautifulSoup loads it seems to actually remove a large chunk of the code. The request content has it all there but the soup doesn't.

Would there be a better way to parse it besides BeautifulSoup? Maybe a regex on the original HTML?

thaweatherman
  • 1,467
  • 4
  • 20
  • 32
  • 1
    @MarcB, BeautifulSoup is a DOM parser. – Paul Draper Apr 13 '14 at 16:31
  • So you're using a regex to parse html? That is a bad idea in a lot of ways (regex cannot handle nested tags etc.) The reason that the DOM parser is unable to parse it is simply that that is not valid html. Failing to parse invalid html shouldn't come as a surprise at all. BeautifulSoup (or another DOM parser) is the best way to parse html, you just have to give it html (which that snippet is not) – Cedric Mamo Apr 15 '14 at 07:31

2 Answers2

2

You mean like this?

from bs4 import BeautifulSoup
html = '''
<div class="style7" style="text-align: justify; overflow: auto;">
        <br />++++++++++[>++++++++++++>+++++++++++>++++++++++++>+++++++++++>++++++++++>++++++++++++>++++++++++>++++++++++>+++++++++++>+++++++++++>++++++++++>++++++++++++<<<<<<<<<<<<-]>-----.>++++.>---.>-.>+++.>+.>+++.>++.>+.>---.>-.>-----.<br /><br /><br />
</div>
'''
soup = BeautifulSoup(html)
div_tag = soup.find('div', attrs={'class':'style7'})
div_tag.text.strip()
u'++++++++++[>++++++++++++>+++++++++++>++++++++++++>+++++++++++>++++++++++>++++++++++++>++++++++++>++++++++++>+++++++++++>+++++++++++>++++++++++>++++++++++++<<<<<<<<<<<<-]>-----.>++++.>---.>-.>+++.>+.>+++.>++.>+.>---.>-.>-----.'
shaktimaan
  • 11,962
  • 2
  • 29
  • 33
  • @thaweatherman What is the expected output? `++++++++++[>++++++++++++>+++++++++++>++++++++++++>+++++++++++>++++++++++>++++++++++++>++++++++++>++++++++++>+++++++++++>+++++++++++>++++++++++>++++++++++++<<<<<<<<<<<<-]>-----.>++++.>---.>-.>+++.>+.>+++.>++.>+.>---.>-.>-----.` ? – allcaps Apr 13 '14 at 18:10
  • How can @shaktimaan output have `<<<<<‌​<<<<<<<-]>` here it's also stripped (interpreted as a tag). – allcaps Apr 13 '14 at 20:14
1

I noticed that when the HTML was loaded into a soup it removed a good chunk of the brainfuck code. This makes it impossible to get everything. If it did not do that then shaktimaan's solution would work.

Instead I took the string in the requests content and used a regex to get the brainfuck code.

m = re.search('<br />[[\]<>.,+-]+<br />', r.content)

This grabs it out then you just need to strip the leading and trailing <br /> and it is good to go.

thaweatherman
  • 1,467
  • 4
  • 20
  • 32