Robots.txt and Allow?

Question

So I'm new to web crawling and I'm having trouble understanding a particular robots.txt file. In this case, this is what the website has:

User-agent: *

Allow: /

Sitemap: sitemapURLHere

So I looked up the / here and found it was for any path. So does this mean that the website allows all pages to be crawled? However, when I try to do a basic crawl on the sitemap.xml (or another site URL) link with scrapy, i.e

scrapy shell siteURL

I get a 403 HTTP response, which I'm assuming from this link means that the website doesn't want you to scrape... so what exactly does this site's robots.txt mean?

EDIT The file I am talking about is here

It probably is a page requiring authentication. A 404 would typically indicate that it shouldn't be scrapped — OneCricketeer, Jun 09 '17 at 00:03
@cricket_007 I see! I just printed out the site's `response.text` and realized that it was asking for a captcha, so that would be my problem, correct? — ocean800, Jun 09 '17 at 00:13
File robots.txt using by search crawlers, but server can use any another limits. Etc you can see allow * but server will not show you content without correct user-agent. — Verz1Lka, Jun 09 '17 at 06:04
If you can open the same url in the browser, you could try 1) set the user agent equal to the browser one or 2) verify there are no javascript in your way. — Djunzu, Jun 10 '17 at 03:46

score 0 · Answer 1 · answered Feb 01 '18 at 13:30

It means "any user agent (bots) can access all content" and "there is a sitemap called sitemapURLHere available in the same directory as robots.txt".

REM: a robots.txt is only a set of indications, not a mean to enforce access restriction. If you can't scrap, it is not because of the robots.txt itself.

Robots.txt and Allow?

1 Answers1