5

Learning Python, I'm trying to make a web scraper without any 3rd party libraries, so that the process isn't simplified for me, and I know what I am doing. I looked through several online resources, but all of which have left me confused about certain things.

The html looks something like this,

<html>
<head>...</head>
<body>
    *lots of other <div> tags*
<div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal"">
<form class ="subform">...</form>
<div class = "subdiv1" >...</div>
<div class = "subdiv2" >...</div>
    *lots of other <div> tags*
</body>
</html>

I want the scraper to extract the <div class = "want"...>*content*</div> and save that into a html file.

I have a very basic idea of how I need to go about this.

import urllib
from urllib import request
#import re
#from html.parser import HTMLParser

response = urllib.request.urlopen("http://website.com")
html = response.read()

#Some how extract that wanted data

f = open('page.html', 'w')
f.write(data)
f.close()
Cœur
  • 37,241
  • 25
  • 195
  • 267
Red
  • 53
  • 1
  • 3
  • 2
    @ggorlen this was 10 years ago when I was learning how to use python - as the very first sentence explicitly states. my use case was to just retrieve some data in a html page, so i wouldn't be recreating all the bells and whistles BS provides, just the functionality of grabbing a specific element by means of some selector. – Red Feb 22 '23 at 04:25
  • Do you know if it's possible to scrape using only built-in libraries but to wait until the JS content is loaded? – Nermin May 01 '23 at 13:28

1 Answers1

5

The standard library comes with a variety of Structured Markup Processing Tools, which you can use for parsing the HTML and then searching it to extract your div.

There's a whole lot of choices there. What do you use?

html.parser looks like the obvious choice, but I'd actually start with ElementTree instead. It's a very nice and very powerful API, and there's tons of documentation and sample code all over the web to get you started, and a lot of experts using it on a daily basis who can help you with your problems. If it turns out that etree can't parse your HTML, you will have to use something else… but try it first.

For example, with a few minor fixes to you snipped HTML so it's actually valid, and so there's actually some text worth getting out of your div:

<html>
<head>...</head>
<body>
    *lots of other <div /> tags*
<div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal">spam spam spam
<form class ="subform">...</form>
<div class = "subdiv1" >...</div>
<div class = "subdiv2" >...</div>
    *lots of other <div /> tags*
</div>
</body>
</html>

You can use code like this (I'm assuming you know, or are willing to learn, XPath):

tree = ElementTree.fromstring(page)
mydiv = tree.find('.//div[@class="want"]')

Now you've got a reference to the div with class "want". You can get its direct text with this:

print(mydiv.text)

But if you want to extract the whole subtree, that's even easier:

data = ElementTree.tostring(mydiv)

If you want to wrap that up in a valid <html> and <body> and/or remove the <div> itself, you'll have to do that part manually. The documentation explains how to build up elements using a simple tree API: you create a head and a body to put in the html, then stick the div in the body, then tostring the html, and that's about it.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I forgot to mention before that I did once try XPath, but was stumped on how to use it, haha. Your code works flawlessly (as expected) with the above HTML example. However, when I tried this with the website I was trying to scrape, it seemed that I got a `xml.etree.ElementTree.ParseError: not well-formed (invalid token) error`. Now I'm guessing it's because that the website's HTML isn't properly validated. Guess this is pushing me to the soup side :P – Red Aug 10 '13 at 13:58
  • If you're not sure whether the site is valid, you may want to use an online HTML validator to check. (There are zillions; just search for one.) Also, even if it's valid, note that HTML (except for XHTML and the XML rendering of HTML5) isn't actually valid XML, so there's no assurance that an XML-parsing library like `ElementTree` will handle it. Practically speaking, most valid HTML 4.01 strict and HTML5, and a lot of 4.01 transitional, will work, but not all (and earlier versions are much less likely to work). – abarnert Aug 12 '13 at 18:24
  • So what do you recommend to be a more effective alternative method (: – Red Aug 13 '13 at 12:34
  • @Red: It depends on what the problem is. But installing and using `beautifulsoup4` is almost always the best answer for parsing potentially-invalid HTML or XML. You may also want to install `html5lib` and/or `lxml` (not to use them directly, but for BeautifulSoup to use). You're still writing the part you wanted to write, just using a smarter parser than the ones in the stdlib. – abarnert Aug 13 '13 at 18:55