-1

I have been trying to do web-scraping using Nokogiri.

I want to get the content loaded after some time due to JavaScript possibly. I have tried using sleep but I dont know where am I going wrong.

Here is the snippet:

require 'nokogiri'
require "open-uri"
require 'json'

url='https://www.instagram.com/someuser/'
file = Nokogiri::HTML(open(url))
sleep 600
puts file
data = JSON.parse file
links=file.css('div.v1Nh3 a')
puts links

I am not getting any links.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Nokogiri is not a JavaScript parser, it only works for HTML/XML or XML derived data. Use `wget`, `curl` or `nokogiri` at the command-line to see what actual source that is being parsed. `sleep` won't help, you have to use one of the Watir-based tools or something similar. – the Tin Man Jul 05 '20 at 17:27
  • Also see https://stackoverflow.com/a/19714636/128421, which describes the process of cherry-picking data from DHTML. Sometimes that's all that is needed. – the Tin Man Jul 05 '20 at 17:33
  • Also, please take the [tour] and read "[ask]" and its linked pages. Grammar is important on SO. I'd recommend running a grammar checker as you write your question. – the Tin Man Jul 05 '20 at 17:42
  • Sure, I will look into it.Also can you help me with one more doubt regarding how to request all the data from instagram and not just 10 using nokogiri? – Kirti Poddar Jul 06 '20 at 05:21
  • Instagram probably has an API, so, if they do, use it and don't try to scrape pages. Scraping is very error-prone and is VERY old school. An API is clean, efficient, and how you should do it. – the Tin Man Jul 16 '20 at 21:50

1 Answers1

0

The content you are looking for must be loaded via jQuery or AJAX, and I don't think Nokogiri can handle that.

You should look at the "Watir" gem and use it to open the URL in a browser, which you can then parse with Nokogiri.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
askprod
  • 99
  • 8
  • It's not possible with only nokogiri using sleep?? – Kirti Poddar Jul 05 '20 at 15:10
  • No, it's not. The data won't exist in the HTML, it's loaded after the browser loads the page, runs the JavaScript, then makes a second request to the server for that payload. `sleep` only pauses the script, it has nothing to do with parsing. I'd suggest studying how AJAX works. https://en.wikipedia.org/wiki/Ajax_(programming). WATIR tells the browser to load the page, which then processes the JavaScript, then it asks the browser for the HTML for the page containing the final information that was rendered. That content can vary wildly from the initial HTML. – the Tin Man Jul 05 '20 at 17:36