-2

This code work on some pages, like klix.ba, but cant figure out why it doesn't work for others.

There is no error to explain what went wrong, nothing.

If puts page works, which means I can target the page, and parse it, why I cant get single elements?

require 'nokogiri'
require 'open-uri'


url = 'http://www.olx.ba/'

user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"


page = Nokogiri::XML(open(url,'User-Agent' => user_agent), nil, "UTF-8")

#puts page - This line work

puts page.xpath('a')
Nemilenko
  • 353
  • 2
  • 3
  • 9
  • You are parsing from XML, why don't you parse the HTML `Nokogiri::HTML(open(url)` – Cyzanfar Jan 19 '16 at 17:16
  • Welcome to Stack Overflow. Please read "[ask]" and "[mcve]". We need a better idea of the problem. What have you tried when debugging? What sites work, and what don't? – the Tin Man Jan 19 '16 at 21:38
  • Sorry, I didn't know what else to write. It was strange problem without any error messages, as I mentioned above. Works just fine on one page, but fails on another.Then @Phil M mention that calling XML probably causing the problem, and he was right. – Nemilenko Jan 19 '16 at 22:54

2 Answers2

1

First of all, why are you parsing it as XML? The following should be correct, considering your page is a HTML website:

page = Nokogiri::HTML(open(url,'User-Agent' => user_agent), nil, "UTF-8")

Furthermore, if you want to strip out all the links (a-tags), this is how:

page.css('a').each do |element|
   puts element
end
Philipp Meissner
  • 5,273
  • 5
  • 34
  • 59
  • That `each` block won't strip out links. It only iterates over them and prints them. You should change the wording or add the code to actually strip them. But, why even mention it as that wasn't part of the question. – the Tin Man Jan 19 '16 at 21:40
  • Hi. I understood the OP's as if that he wants to get all the a-tags of a certain website (`puts page.xpath('a')`). Now that's why I showed how to address an a-tag through css (`page.css('a')`) which will give him all a-elements. To output said element (Yes, the entire element as in `Bar`) I used the `.each` loop. Hope that solves the questions :) – Philipp Meissner Jan 20 '16 at 16:15
0

If you are want to parse content from a web page you need to do this:

require 'nokogiri'
require 'open-uri'


url = 'http://www.olx.ba/'

user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"


page = Nokogiri::HTML(open(url,'User-Agent' => user_agent), nil, "UTF-8")

#puts page - This line work

puts page.xpath('a')

Here take a look at the Nokogiri documentation

One thing I would suggest is to use a debugger break point in your code (probably after assigning page). Look at the Pry-debugger gem.

So I would do something like this:

 require 'nokogiri'
 require 'open-uri'
 require 'pry' # require the necessary library


    url = 'http://www.olx.ba/'

    user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"


    page = Nokogiri::HTML(open(url,'User-Agent' => user_agent), nil, "UTF-8")
    binding.pry # stop a moment in time in you code (break point)

    #puts page - This line work

    puts page.xpath('a')
Cyzanfar
  • 6,997
  • 9
  • 43
  • 81