0

I am trying to scrape a site but the results returned for just the links is different from when I inspect it with the browser.

In my browser I get normal links but all the a HREF links all become javascript:void(0); from Nokogiri.

Here is the site:

https://www.ctgoodjobs.hk/jobs/part-time

Here is my code:

url = "https://www.ctgoodjobs.hk/jobs/part-time"
response = open(url) rescue nil
next unless response
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').text
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Kelvin
  • 2,218
  • 4
  • 22
  • 41
  • Please read "[mcve]". When asking about code you wrote, we expect the minimal code, and minimal input data, in the question that demonstrates the problem. Not doing that forces us to work from huge HTML files and separate them into the usable and important parts, wasting out time with slows our ability to help you and potentially distracts from the real problem. – the Tin Man Nov 09 '16 at 17:51

2 Answers2

1

is not that easy, urls are "obscured" using a js function, that's why you're getting javascript: void(0) when asking for the hrefs... looking at the html, there are some hidden inputs for each link, and, there is a preview url that you can use to build the job preview url (if that's what you're looking for), so you have this:

<div class="result-list-job current-view">
  <input type="hidden" name="job_id" value="04375145">
  <input type="hidden" name="each_job_title_url" value="barista-senior-barista-咖啡調配員">
  <h2 class="job-title"><a href="javascript:void(0);">Barista/ Senior Barista 咖 啡 調 配 員</a></h2>
  <h3 class="job-company"><a href="/company-jobs/pacific-coffee-company/00028652" target="_blank">PACIFIC COFFEE CO. LTD.</a></h3>
  <div class="job-description">
    <ul class="job-desc-list clearfix">
      <li class="job-desc-loc job-desc-small-icon">-</li>
      <li class="job-desc-work-exp">0-1 yr(s)</li>
      <li class="job-desc-salary job-desc-small-icon">-</li>
      <li class="job-desc-post-date">09/11/16</li>
    </ul>
  </div>
  <a class="job-save-btn" title="save this job" style="display: inline;"> </a>
  <div class="job-batch-apply"><span class="checkbox" style="background-position: 0px 0px;"></span><input type="checkbox" class="styled" name="job_checkbox" value="04375145"></div>
  <div class="job-cat job-cat-de"></div>
</div>

then, you can retrieve each job_id from those inputs, like:

 inputs = doc.search('//input[@name="job_id"]')

and then build the urls (i found the base url at joblist_preview.js:

 urls = inputs.map do |input|
   "https://www.ctgoodjobs.hk/english/jobdetails/details.asp?m_jobid=#{input['value']}&joblistmode=previewlist&ga_channel=ct"
 end
mr_sudaca
  • 1,156
  • 7
  • 9
0

Take the output of a browser and that of a tool like wget, curl or nokogiri and you will find the HTML the browser presents can differ drastically from the raw HTML.

Browsers these days can process DHTML, Nokogiri doesn't. You can only retrieve the raw HTML using something that lets you see the content without the browser, like the above mentioned tools, then compare that with what you see in a text editor, or what nokogiri shows you. Don't trust the browser - they're known to lie because they want to make you happy.

Here's a quick glimpse into what the raw HTML contains, generated using:

$ nokogiri "https://www.ctgoodjobs.hk/jobs/part-time"

Nokogiri dropped me into IRB:

Your document is stored in @doc...
Welcome to NOKOGIRI. You are using ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-darwin15]. Have fun ;)

Counting the hits found by the selector returns:

>> @doc.search('.job-title > a').size
30

Displaying the text found shows:

>> @doc.search('.job-title > a').map(&:text)
[
  [ 0] "嬰 兒 奶 粉 沖 調 機 - 兼 職 產 品 推 廣 員 Part Time Promoter (時 薪 高 達 HK$90, 另 設 銷 售 佣 金 )",
...
  [29] "Customer Services Representative (Part-time)"
]

Looking at the actual href:

>> @doc.search('.job-title > a').map{ |n| n['href'] }
[
  [ 0] "javascript:void(0);",
...
  [29] "javascript:void(0);"
]

You can tell the HTML doesn't contain anything but what Nokogiri is telling you, so the browser is post-processing the HTML, processing the DHTML and modifying the page you see if you use something to look at the HTML. So, the short fix is, don't trust the browser if you want to know what the server sends to you.

This is why scraping isn't very reliable and you should use an API if at all possible. If you can't, then you're going to have to roll up your sleeves and dig into the JavaScript and manually interpret what it's doing, then retrieve the data and parse it into something useful.

Your code can be cleaned up and simplified. I'd write it much more simply as:

url = "https://www.ctgoodjobs.hk/jobs/part-time"
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').map(&:text)

The use of search(...).text is a big mistake. text, when applied to a NodeSet, will concatenate the text of each contained node, making it extremely difficult to retrieve the individual text. Consider this:

require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>
EOT

doc.search('p').class # => Nokogiri::XML::NodeSet

doc.search('p').text # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]

The first result foobar would require being split apart to be useful, and unless you have special knowledge of the content, trying to figure out how to do it will be a major pain.

Instead, use map to iterate through the elements and apply &:text to each one, returning an array of each element's text.

See "How to avoid joining all text from Nodes when scraping" and "Taking apart a DHTML page" also.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303