5

I'm trying to get a page with an ISO-8859-1 encoding clicking on a link, so the code is similar to this:

page_result = page.link_with( :text => 'link_text' ).click

So far I get the result with a wrong encoding, so I see characters like:

'T�tulo:' instead of 'Título:'

I've tried several approaches, including:

  • Stating the encoding in the first request using the agent like:

    @page_search = @agent.get(
      :url => 'http://www.server.com',
      :headers => { 'Accept-Charset' => 'ISO-8859-1' } )
    
  • Stating the encoding for the page itself

      page_result.encoding = 'ISO-8859-1'
    

But I must be doing something wrong: a simple puts always show the wrong characters.

Do you know how to state the encoding?

Thanks in advance,

Added: Executable example:

require 'rubygems'
require 'mechanize'

WWW::Mechanize::Util::CODE_DIC[:SJIS] = "ISO-8859-1"

@agent = WWW::Mechanize.new

@page = @agent.get(
  :url => 'http://www.mcu.es/webISBN/tituloSimpleFilter.do?cache=init&layout=busquedaisbn&language=es',
  :headers => { 'Accept-Charset' => 'utf-8' } )

puts @page.body
Juan
  • 989
  • 1
  • 14
  • 22

4 Answers4

10

Hey you can just do a:

agent.page.encoding = 'utf-8'

Hope it helps!

Niels Kristian
  • 8,661
  • 11
  • 59
  • 117
4

The previous answer is correct, but in my code it looks slightly different:

agent = Mechanize.new

page = agent.get('http://example.com')

page.encoding = 'windows-1251'

page.search('p').each do |para|
  puts para.text
end
denis.peplin
  • 9,585
  • 3
  • 48
  • 55
1

Sorry, it was my mistake: I come from a Java background and there strings are internally converted to utf-16. I forgot Ruby doesn't do it. Mechanize was recovering the page flawlessly, but I needed to convert the data via iconv.

Mental note: Ruby stores the strings without converting its encoding.

Juan
  • 989
  • 1
  • 14
  • 22
  • you might also wanna try ruby 1.9 if possible, they added a whole lot of [unicode stuff](http://blog.nuclearsquid.com/writings/ruby-1-9-encodings) – Marc Seeger Dec 15 '09 at 08:36
0

Yeah, Mechanize will try to detect the encoding itself (using the NKF core Ruby library) to guess the encoding) and sometimes fails.

Maybe this might help:
WWW::Mechanize::Util::CODE_DIC[:SJIS] = "ISO-8859-1"

I'm not too sure about the exact syntax, but I think the CODE_DICT Hash might be a good place to look :)
I had a similar problem a while back.

Community
  • 1
  • 1
Marc Seeger
  • 2,717
  • 4
  • 28
  • 32