-1

In the book "Instant Nokogiri" and on the Packt Hub Nokogiri page it has a User-Agent application for spoofing a browser while crawling the New York Times website for the top story.

I am working through this book but the code is a little dated, but I updated it.

My version of the code is:

require 'open-uri'
require 'nokogiri'
require 'sinatra'

browser = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4)
AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1'

doc = Nokogiri::HTML(open ('http://nytimes.com', browser))

nyt_headline = doc.at_css('h2 span').content

nyt_url = "http://nytimes.com" + doc.at_css('.css-16ugw5f a')[:href]


html = "<h1>Nokogiri News Service</h1>"
html += "<h2>Top Story: <a href=\"#{nyt_url}\">#{nyt_headline}</a></h2>"

get '/' do
    html
end

I am running this through a terminal session on Mac OS and getting this error:

invalid access mode Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) (ArgumentError)
AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1 (URI::HTTP resource is read only.)

I don't believe I am attempting to 'write'. Not sure why a 'read only' error would block this from running. It was working before I added the User Agent info.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
MetaG
  • 39
  • 10
  • 2
    According to the [Docs](https://ruby-doc.org/stdlib-2.6.1/libdoc/open-uri/rdoc/OpenURI.html) you should specify that is the User-Agent eg.`Nokogiri::HTML(open ('http://nytimes.com', "User-Agent" => browser))` however I am not sure about that tutorial since this has been the same since [1.8.7](https://ruby-doc.org/stdlib-1.8.7/libdoc/open-uri/rdoc/OpenURI.html) circa June 1st, 2008 – engineersmnky Feb 05 '20 at 19:49
  • 1
    The internet is a vast wasteland of trash-heaps of old information. It's really important to check the date the information was published, and if it's not current be very wary. Always start with the official documentation as it should be the most recent, and then work backwards. – the Tin Man Feb 05 '20 at 19:51
  • 1
    @engineersmnky, it's in 1.8.6 too. – the Tin Man Feb 05 '20 at 19:53
  • 2
    As a word of caution, scraping sites is usually a violation of their TOS. In the wild days of the Internet it was common, but it is frowned upon these days. Instead you're expected to rely on an API provided by the source. APIs are so easily created and are much faster and less error-prone. There are occasions when we need scrapers; I've written hundreds of them because of working with a company that did analytics and such as a service for big corporations, but in general search for the API and use it. It helps keep you from being banned and lowers the CPU and network load for both of you. – the Tin Man Feb 05 '20 at 20:07
  • 2
    Finally, tying your Sinatra code to another site's response time is not a good practice. If the NYT server is slow, your site will be slow. Instead, you should have a secondary script that runs periodically and checks to see if that page has updated, and, if so, retrieves and parses it and updates a little backing database of the necessary information, which your Sinatra code then accesses to serve the information. Don't use threads, just write a simple `cron`-based script or start the script when the server starts, have it check, then sleep an hour, check for a change, then sleep. – the Tin Man Feb 05 '20 at 20:18
  • According to my copy of Instant Nokogiri and the website sited above the info is from 2013, and uses 1.8.7. If that information is inaccurate I am not yet sure how to tell. The book does say that web scraping is not preferred and should only be done lacking an API. I will scrap this and search for a tutorial on the API method you described above. Thank you @theTinMan. – MetaG Feb 05 '20 at 22:51
  • 1
    Even two years can be a long time in the life of languages and services; In seven years Ruby has had some significant improvements and additions, and that breaks many books and tutorials badly. JSON feeds are _very_ common and easy to work with. XML and/or RSS are also common and not quite as friendly but Nokogiri can ease that pain and there might be pre-built wheels. See https://developer.nytimes.com/ for their API information. – the Tin Man Feb 05 '20 at 23:01

1 Answers1

4

See OpenURI's open documentation:

URI.open("http://www.ruby-lang.org/en/",
  "User-Agent" => "Ruby/#{RUBY_VERSION}",
  "From" => "foo@bar.invalid",
  "Referer" => "http://www.ruby-lang.org/") {|f|
  # ...
}

The options are a Hash. You're passing a String.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303