1

So I'm new to web crawling and I'm having trouble understanding a particular robots.txt file. In this case, this is what the website has:

User-agent: *

Allow: /

Sitemap: sitemapURLHere 

So I looked up the / here and found it was for any path. So does this mean that the website allows all pages to be crawled? However, when I try to do a basic crawl on the sitemap.xml (or another site URL) link with scrapy, i.e

scrapy shell siteURL 

I get a 403 HTTP response, which I'm assuming from this link means that the website doesn't want you to scrape... so what exactly does this site's robots.txt mean?

EDIT The file I am talking about is here

ocean800
  • 3,489
  • 13
  • 41
  • 73
  • 1
    It probably is a page requiring authentication. A 404 would typically indicate that it shouldn't be scrapped – OneCricketeer Jun 09 '17 at 00:03
  • @cricket_007 I see! I just printed out the site's `response.text` and realized that it was asking for a captcha, so that would be my problem, correct? – ocean800 Jun 09 '17 at 00:13
  • 1
    It would seem so – OneCricketeer Jun 09 '17 at 00:14
  • 1
    File robots.txt using by search crawlers, but server can use any another limits. Etc you can see allow * but server will not show you content without correct user-agent. – Verz1Lka Jun 09 '17 at 06:04
  • 1
    If you can open the same url in the browser, you could try 1) set the user agent equal to the browser one or 2) verify there are no javascript in your way. – Djunzu Jun 10 '17 at 03:46

1 Answers1

0

It means "any user agent (bots) can access all content" and "there is a sitemap called sitemapURLHere available in the same directory as robots.txt".

REM: a robots.txt is only a set of indications, not a mean to enforce access restriction. If you can't scrap, it is not because of the robots.txt itself.

Jérôme Verstrynge
  • 57,710
  • 92
  • 283
  • 453