0

I'm using the following code:

require 'rubygems'
require 'mechanize'
require 'nokogiri'  
require 'open-uri'  
require 'logger'
require 'slowweb'
SlowWeb.limit('linkedin.com', 1, 10)

#create agent
agent = Mechanize.new { |agent| 
  agent.user_agent_alias = 'Mac Firefox'
  agent.log = Logger.new "mech.log" 
}
agent.follow_meta_refresh = true
page = agent.get("https://ca.linkedin.com/")

#login
login_form = page.forms.first
login_form.session_key = "username"
login_form.session_password = "pass"

page = agent.submit(login_form, login_form.buttons.first)
url = agent.get("https://www.linkedin.com/vsearch/f?type=all&keywords=Recruiter+Boston")
results = agent.get(url).body.scan(/\{"person"\:\{.*?\}\}/)
results.each do |person|
  json = JSON.parse(person)
  puts json['person']['firstName'] 
  puts json['person']['lastName']
end

This lists people who are my current connections, so I'm logged in, but when doing the search manually, it lists Boston Recruiters.

I suspect my crawler is recognized and being gamed, but if you have any other ideas I'd love to hear them.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • What have you done to prove/disprove your suspicion? When asking a question like this, we need a minimal HTML sample that works with your code, and your code should be runnable. As is you're asking us to have a LinkedIn account and set up code that can run your code. See "[ask]", including the links at the bottom of the page. I'd recommend using Nokogiri to parse the HTML as it'll help remove the chance of false-positives since regex are not good at handling markup. – the Tin Man Feb 04 '16 at 21:43
  • Code added. Linkedin search results are spit out in javascript so the json is needed vs nokogiri. prove/disprove: I've run the script & the names that come back are my current connections, a manual search w the same URL cut/past comes back as recruiters from boston. – user1222303 Feb 04 '16 at 22:47
  • Much better. You might want to look at the real user-agent string for Mac OS Firefox. http://www.useragentstring.com/pages/Firefox/ *If* they're sniffing your user-agent, using the full string could help. Rather than scraping, have you tried using their API? Scraping is sure to be a violation of their TOS, whereas using their API will avoid these sort of problems. – the Tin Man Feb 04 '16 at 23:12
  • Were you able to resolve this? I am having the exact same issue. – Martin Sommer Jun 10 '16 at 04:01

0 Answers0