1

I'm trying to scrape a few sites that require unicode support. For example, I'm trying to get the title of this book, but it returns jumbled characters:

(-> "http://www.brill.nl/publications/evliya-celebis-book-travels" 
      java.net.URL. enlive/html-resource
 (enlive/select [:h1#page-title]) first :content)

And trying to scrape an Arabic site returns with ?????? all over the place.

(enlive/html-resource (java.net.URL. "http://www.aljazeera.net/portal"))

I'm not sure how I'm supposed to activate unicode support.

pooya72
  • 1,003
  • 9
  • 15

2 Answers2

2

Enlive does have unicode support because it uses Java strings. I ran your first example on my computer and got this result:

(Evliyā Çelebi's Book of Travels)

Perhaps the font that you are using doesn't have glyphs for the pointcodes that you are trying to show?

Andrew
  • 7,286
  • 3
  • 28
  • 38
  • I'm on a Mac using Deja Vu Sans Mono on the standard terminal with unicode Utf-8. I even tried it with iterm2. All it returns with is ("Evliy? ?elebi's Book of Travels"). **BUT** if i spit it out to a file like "titles.txt" then it displays the title correctly. **Yet** if i copy and past the title back in to the repl and run (.codePointAt "Evliyā Çelebi's Book of Travels" 5) I get 402, but if run the same code on the string returned by enlive i get 257. (char 402) and (char 257) both return ?. What are you using? – pooya72 May 17 '12 at 23:01
2

Christophe Grand, the author of enlive, was kind of enough to reply on the Enlive email group. His suggestion was quite informative. I have copied the email below:

Hello,

Enlive is not (and does not include) a full-featured HTTP agent. When you pass a java.net.URL to a html-resource it call .getContent on it, get an InputStream an then assume UTF-8. However if you know the actual encoding you can do :

(-> "http://www.brill.nl/publications/evliya-celebis-book-travels" java.net.URL.
  .getContent (java.io.InputStreamReader. "ENCODING GOES HERE")
enlive/html-resource
 (en/select [:h1#page-title]) first :content)

Or use an agent library which will detect the correct encoding and pass the resulting Reader to html-resource.

hth,

Christophe

Community
  • 1
  • 1
pooya72
  • 1,003
  • 9
  • 15