3

I'm working on creating a website scraper. There is a form used to change the current page.

This is the way that I am submitting the form for the POST request, but it seems to be fetching the same page over and over again.

Here is some sample code:

pages = {
 "total_pages" => 19,
 "p1" => '1234/1456/78990/123324345/12143343214345/231432143/12432412/435435/',
 "p2" => '1432424/123421421/345/435/6/65/5/34/3/2/21/1243',
..
..
..    
}


idx = 1
p_count = pages["total_pages"]

#set up the HTTP request to change pages to get all the auction results
uri = URI.parse("http://somerandomwebsite.com?listings")
http = Net::HTTP.new(uri.host, uri.port)
req = Net::HTTP::Post.new(uri.request_uri)

p_count.times do
  puts "On loop sequence: #{idx}"
  pg_num = "p#{idx}"
  pg_content = pages["#{pg_num}"]
  req.set_form_data({"page" => "#{pg_num}", "#{pg_num}" => "#{pg_content}"})

  response = http.request(req)
  page = Nokogiri::HTML(response.body)
  idx = idx + 1
end

It looks like page never changes. Is there a way to see what the full request looks like each time I am looking to make sure that the proper params are getting passed? It seems like it's virtually impossible to determine anything about req.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Zack Herbert
  • 942
  • 1
  • 16
  • 39
  • Please read "[mcve]". Your code won't run, and we have to change it to test to identify the problem. That wastes our time. I'd recommend NOT using Net::HTTP, instead use one of the many HTTP clients that exist for Ruby. Net::HTTP is great if you're inventing a new server type, but it's very low-level for normal HTTP work, especially when you're just requesting pages. As far as seeing the request, http://httpbin.org can be very useful. – the Tin Man Feb 09 '17 at 19:50

1 Answers1

8

A great way to debug HTTP is to take advantage of http://httpbin.org:

require 'net/http'
uri = URI('http://httpbin.org/post')
res = Net::HTTP.post_form(uri, 'q' => 'ruby', 'max' => '50')
puts res.body

Which returns:

# >> {
# >>   "args": {}, 
# >>   "data": "", 
# >>   "files": {}, 
# >>   "form": {
# >>     "max": "50", 
# >>     "q": "ruby"
# >>   }, 
# >>   "headers": {
# >>     "Accept": "*/*", 
# >>     "Accept-Encoding": "gzip;q=1.0,deflate;q=0.6,identity;q=0.3", 
# >>     "Content-Length": "13", 
# >>     "Content-Type": "application/x-www-form-urlencoded", 
# >>     "Host": "httpbin.org", 
# >>     "User-Agent": "Ruby"
# >>   }, 
# >>   "json": null, 
# >>   "origin": "216.69.191.1", 
# >>   "url": "http://httpbin.org/post"
# >> }

That said, I'd recommend not using Net::HTTP. There are plenty of great HTTP clients for Ruby that will make it easier to write your code. For instance here's the same thing using HTTPClient:

require 'httpclient'
clnt = HTTPClient.new
res = clnt.post('http://httpbin.org/post', 'q' => 'ruby', 'max' => '50')
puts res.body

# >> {
# >>   "args": {}, 
# >>   "data": "", 
# >>   "files": {}, 
# >>   "form": {
# >>     "max": "50", 
# >>     "q": "ruby"
# >>   }, 
# >>   "headers": {
# >>     "Accept": "*/*", 
# >>     "Content-Length": "13", 
# >>     "Content-Type": "application/x-www-form-urlencoded", 
# >>     "Date": "Thu, 09 Feb 2017 20:03:57 GMT", 
# >>     "Host": "httpbin.org", 
# >>     "User-Agent": "HTTPClient/1.0 (2.8.3, ruby 2.4.0 (2016-12-24))"
# >>   }, 
# >>   "json": null, 
# >>   "origin": "216.69.191.1", 
# >>   "url": "http://httpbin.org/post"
# >> }

This is untested code because you didn't tell us nearly enough, but it's where I'd start doing what you're doing:

require 'httpclient'

BASE_URL = 'http://somerandomwebsite.com?listings'
PAGES = [
 '1234/1456/78990/123324345/12143343214345/231432143/12432412/435435/',
 '1432424/123421421/345/435/6/65/5/34/3/2/21/1243',
]

clnt = HTTPClient.new

PAGES.each.with_index(1) do |page, idx|
  puts "On loop sequence: #{idx}"

  response = clnt.post(BASE_URL, 'page' => idx, idx => page)

  doc = Nokogiri::HTML(response.body)
  # ...
end
the Tin Man
  • 158,662
  • 42
  • 215
  • 303