0

I'm trying to use open-uri to scrape the data from:

https://www.zomato.com/grande-lisboa/fu-hao-massamá

But, the website is automatically redirecting to:

https://www.zomato.com/pt/grande-lisboa/fu-hao-massamá

I don't want the spanish version. I want the english one. How do I tell ruby to stop doing that?

Alan W. Smith
  • 24,647
  • 4
  • 70
  • 96
leonidus
  • 363
  • 1
  • 3
  • 11
  • 1
    Welcome to Stack Overflow. It looks like you need help asking questions. When you ask a question like this, we need to see a minimal example of your code because that is worth a whole lot more than a description alone. Also, your title says you're using Net::HTTP, but the text says you're using OpenURI to "scrape" the page. Neither Net::HTTP or OpenURI do scraping but they do allow you to get a page. Clearly and consistently explaining what you are doing and showing an example will help you get good answers and up votes. Not doing that will get you down votes or your question closed. – the Tin Man Feb 05 '15 at 18:35

1 Answers1

3

This is called content negotiation - the web server redirects based on your request. pt (Portuguese) seems to be the default: (at least from my location)

$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1
HTTP/1.1 301 Moved Permanently
Set-Cookie: zl=pt; ...
Location: https://www.zomato.com/pt/grande-lisboa/fu-hao-massam%C3%A1

You can request another language by sending an Accept-Language header. Here's the answer for Accept-Language: es (Spanish):

$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: es"
HTTP/1.1 301 Moved Permanently
Set-Cookie: zl=es_cl; ...
Location: https://www.zomato.com/es/grande-lisboa/fu-hao-massam%C3%A1

And here's the answer for Accept-Language: en (English):

$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: en"
HTTP/1.1 200 OK
Set-Cookie: zl=en; ...

This seems to be the resource you've been looking for.

In Ruby you'd use:

require 'nokogiri'
require 'open-uri'

url = 'https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1'
headers = {'Accept-Language' => 'en'}

doc = Nokogiri::HTML(open(url, headers))
doc.at('html')[:lang]
#=> "en"
Stefan
  • 109,145
  • 14
  • 143
  • 218
  • How can I use it with Net::Http.get() request. I'm using nokogiri to parse the html and net/http to bring it. – leonidus Feb 04 '15 at 11:34
  • Never mind, I got it, I had to use post method to use headers. – leonidus Feb 04 '15 at 11:41
  • 2
    You said in your question that you're *"using open-uri"*. However, the documentation for `Net:HTTP` contains a section [Setting Headers](http://ruby-doc.org/stdlib-2.2.0/libdoc/net/http/rdoc/Net/HTTP.html#class-Net::HTTP-label-Setting+Headers), you've probably overlooked it. – Stefan Feb 04 '15 at 11:42