-3

When I try to download one page from Linkedin with the following command:

curl -I https://www.linkedin.com/company/google

I get a 999 status code:

HTTP/1.1 200 Connection established

HTTP/1.1 999 Request denied
Date: Tue, 30 Aug 2016 08:19:35 GMT
X-Li-Pop: prod-tln1-hybla
Content-Length: 1629
Content-Type: text/html

Since users using a browser can access to Linkedin pages, it means that they can make difference between robots and users.

Else users would not be allow to access Linkedin pages due to the following lines at the end of robots.txt:

User-agent: *
Disallow: /

So, Linkedin can make difference between requests coming from browsers and others. How do they do that ?

Gabsn
  • 1
  • 1
  • You can find a lot of details [here](https://techcrunch.com/2016/08/15/linkedin-sues-scrapers/). – paul trmbrth Aug 30 '16 at 08:49
  • Interesting, but my question is ***how*** they do that, not ***why*** they do that... – Gabsn Aug 30 '16 at 08:59
  • If you read the article, it refers to the document to have details on how. "The lawsuit details several of LinkedIn’s automated tools that prevent data harvesting. Dubbed FUSE, Quicksand and Sentinel, these tools monitor the web traffic of LinkedIn users and limit how many other profiles a user can view, and how quickly a user can view those profiles." – paul trmbrth Aug 30 '16 at 09:00
  • Also http://fraudengineering.com/linkedin-anti-scraping-techniques/ – paul trmbrth Aug 30 '16 at 09:06

1 Answers1

-1

For the particular case you presented maybe because you didn't specify your user-agent.

When you do a request it sends headers to your website like user-agent, screen resolution, cookies, language, encoding, etc

In the absence of this information the server can reject connections...

To check what headers a particular website check the network tab in any modern browser when you connect to a website.

Another thing that linkedin does is check for a certain ip if when it requests a webpage, the ajax requests for other elements are made as well. Since most scrappers can't interpret javascript this doesn't happen making it easy to identify if a request was made by a browser or a potential bot.

After that it's all about user behaviour. Accessing pages that can't be reached directly, only through navigating, identify pattern behaviours for the ip/account logged in or even check a users' network. Bigger the network of an account, less likely that user is using that account for scraping purposes.

PS. It's a really really really really BAD idea to scrape linkedin even if you manage to avoid all their mechanisms

Rafael Almeida
  • 5,142
  • 2
  • 20
  • 33