8

All of the information I can find online is about writing web servers, but there seems to be very little about functions useful for web clients. Ideally, I would like the function to look something like this:

(website "http://www.google.com")

And return a string containing the entire web page, but I would be happy with anything that works.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
Alex V
  • 3,416
  • 2
  • 33
  • 52

1 Answers1

10

Here's a simple program that looks like it does what you want:

#lang racket

(require net/url)

(port->bytes
 (get-pure-port (string->url "http://www.google.com")))

If you're like me, you probably also want to parse it into an s-expression. Neil Van Dyke's neil/html-parsing does this:

#lang racket

(require (planet neil/html-parsing:2:0)
         net/url)

(html->xexp
 (get-pure-port (string->url "http://www.google.com")))

Note that since this program refers to a planet package, running this program for the first time will download and install the htmlprag package. Building the documentation could take quite a while. That's an one-time cost, though, and running the program again shouldn't take more than a few seconds.

EDIT: In 2023, this code still works fine, but PLaneT is not widely used at this point, and it would probably be more idiomatic at this point to suggest installing the html-parsing package using raco install html-parsing or with the File>>Package Manager... menu, and then running

#lang racket

(require html-parsing
         net/url)

(html->xexp
 (get-pure-port (string->url "http://www.google.com")))
John Clements
  • 16,895
  • 3
  • 37
  • 52
  • Should have clarified; if all you want is the raw text, you don't need the call to html->sxml, you can just use a (regexp-match #px#".*" ...) to suck the chars out of the pipe. – John Clements Dec 24 '12 at 04:26
  • 2
    `port->string` is probably what you'll see when pulling all the content from a port: http://docs.racket-lang.org/reference/port-lib.html?q=port-%3Estring#(def._%28%28lib._racket/port..rkt%29._port-~3estring%29%29 – dyoo Dec 24 '12 at 05:20
  • @JohnClements Perfect, thanks! I used port->string and it gave me the web page as plain text! – Alex V Dec 24 '12 at 07:33
  • Is there an obvious way built-in? It seems odd that there's no simple way without requiring a third-party library. – JasonFruit Dec 24 '12 at 16:20
  • @JasonFruit That was what I was hoping to find in this question. But as far as I can tell the answer is no. Every way to do this is either complex, non-portable, or requires a third party library. – Alex V Dec 25 '12 at 23:26
  • Bizarre. Sometimes I think Racket is like people my old preacher used to describe as "so heavenly-minded they're no earthly good." – JasonFruit Dec 26 '12 at 03:50
  • Huh? I'm confused. If all you want is the text, you can delete the reference to (planet neil/htmlprag), and it's all entirely built-in. Did I misunderstand you? – John Clements Dec 26 '12 at 04:30
  • @JohnClements Actually, I very much misunderstood you. However, do you think for the sake of posterity you can change the code in your answer to something like http://pastebin.com/ye7QivJQ so people who come to this answer will not get confused like me and Jason did? – Alex V Dec 27 '12 at 01:22
  • 1
    Sure thing; let me know what you think. Also, I used port->bytes; I'm guessing that the relevant RFCs are specified using bytes rather than utf-8. – John Clements Dec 27 '12 at 06:00