Web page scraped with Nokogiri returns no data

Question

I am trying to scrape a project list from the British government's UK Oil Portal, but my code returns no data. Instead, I want to make an array of project titles.

class Entry
  def initialize(title)
    @title = title
  end
  attr_reader :title
end

def index
  @projects=Project.all
  require 'open-uri'
  require 'nokogiri'
  doc = Nokogiri::HTML(open("https://itportal.decc.gov.uk/pathfinder/currentprojectsindex.html"))

  entries = doc.css('.operator-container')
  @entries = []
  entries.each do |row|
    title = row.css('.setoutForm').text
    @entries << Entry.new(title)
  end
end

entries = doc.css('.operator-container') in this section you need to specify the content which you want from html tag like suppose i have div which has class container then it should be like doc.css('div container') — Tushar Pal, Jun 26 '17 at 11:55
@fred if you solved this yourself, please consider posting your own answer — dcorking, Jun 28 '17 at 15:10
see also https://stackoverflow.com/questions/25436818/nokogiri-scraping-misses-html — dcorking, Jun 28 '17 at 15:22
Please read "[mcve]". When asking we need to see the minimum HTML necessary to demonstrate the problem _in the question itself_, along with the minimum code that uses that data that demonstrates the problem. Asking us to go to off-site links and parse through it wastes our time that could be used to help others, so help us help you. — the Tin Man, Jul 03 '17 at 20:34
Why did you override my edit? As @theTinMan said, it is important that the example should be complete, so please include samples of the HTML source that you scraped. — dcorking, Jul 18 '17 at 10:19
It's improper to add or modify code and, by extension, the data necessary to make the question sensible. At the same time, it's a requirement that we have that necessary data when asking about a problem involving the code. "[mcve]". Without that data the question is off-topic, so simply downvote it and vote to close. — the Tin Man, Jul 18 '17 at 21:02
@theTinMan https://meta.stackoverflow.com/questions/314190/when-to-add-code-to-a-users-question — dcorking, Jul 23 '17 at 10:09

score 4 · Accepted Answer · edited Jul 03 '17 at 20:27

The link you posted contains no data. The page you see is a frameset, with each frame created by its own URL. You want to parse the left frame, so you should edit your code to open the URL of the left frame:

  doc = Nokogiri::HTML(open('https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index'))

The individual projects are on separate pages, and you need to open each one. For example the first one is:

project_file = open(entries.first.css('a').attribute('href').value)       
project_doc = Nokogiri::HTML(project_file)

The "setoutForm" class scrapes lots of text. For example:

> project_doc.css('.setoutForm').text
=> "\n            \n              Field Type\n              Location\n              Water De
pth (m)\n              First Production\n              Contact\n            \n            \n
              Oil\n              2/15\n              155m\n              Q3/2018\n          
    \n                John Gill\n                Business Development Manager\n             
   jgill@alphapetroleum.com\n                01483 307204\n              \n            \n   
       \n            \n              Project Summary\n            \n            \n          
    \n                The Cheviot discovery is located in blocks 2/10a, 2/15a and 3/11b. \n 
               \n                Reserves are approximately 46mmbbls oil.\n                \
n                A Field Development Plan has been submitted and technically approved. The c
oncept is for a leased FPSA with 18+ subsea wells. Oil export will be via tanker offloading.
\n                \n              \n            \n          "

However the title is not in that text. If you want the title, scrape this part of the page:

<div class="field-header" foxid="eu1KcH_d4qniAjiN">Cheviot</div>

Which you could do with this CSS selector:

> project_doc.css('.operator-container .field-header').text
=> "Cheviot"

Write this code step by step. It is hard to find out where your code goes wrong, unless you single-step it. For example, I used Nokogiri's command line tool to open an interactive Ruby shell, with

nokogiri https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index

@fred If my answer helped, and is the quality you expect here, please consider up-voting and accepting it. — dcorking, Jun 27 '17 at 10:39
Don't use `css` and `text` unless you understand what is going to happen. Instead, use one of the `at` variants with `text` or `map(&:text)` on `search`, `css`, or `xpath`. See https://stackoverflow.com/q/43594656/128421. — the Tin Man, Jul 03 '17 at 20:31

Web page scraped with Nokogiri returns no data

1 Answers1