Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
6
votes
2 answers

How to work with RobotsTxtMiddleware in Scrapy framework?

Scrapy framework have RobotsTxtMiddleware. It needs to make sure Scrapy respects robots.txt. It need's to set ROBOTSTXT_OBEY = True in settings, then Scrapy will respect robots.txt policies. I did it and run spider. In debug I Have seen request to…
Max
  • 315
  • 4
  • 17
6
votes
1 answer

Python requests vs. robots.txt

I have a script meant for personal use that scrapes some websites for information and until recently it worked just fine, but it seems one of the websites buffed up its security and I can no longer get access to its contents. I'm using python with…
Austin
  • 427
  • 2
  • 7
  • 16
6
votes
2 answers

Is the User-Agent line in robots.txt an exact match or a substring match?

When a crawler reads the User-Agent line of a robots.txt file, does it attempt to match it exactly to its own User-Agent or does it attempt to match it as a substring of its User-Agent? Everything I have read does not explicitly answer this…
josephdpurcell
  • 1,157
  • 3
  • 16
  • 34
6
votes
5 answers

Robotparser doesn't seem to parse correctly

I am writing a crawler and for this I am implementing the robots.txt parser, I am using the standard lib robotparser. It seems that robotparser is not parsing correctly, I am debugging my crawler using Google's robots.txt. (Following examples are…
user689383
6
votes
1 answer

robots.txt URL format

According to this page globbing and regular expression are not supported in either the User-agent or Disallow lines However, I noticed that the stackoverflow robots.txt includes characters like * and ? in the URLs. Are these supported or…
Dónal
  • 185,044
  • 174
  • 569
  • 824
6
votes
1 answer

Can I use robots.txt to block certain URL parameters?

Before you tell me 'what have you tried', and 'test this yourself', I would like to note that robots.txt updates awfully slow for my siteany site on search engines, so if you could provide theoretical experience, that would be appreciated. For…
Lucas
  • 16,930
  • 31
  • 110
  • 182
6
votes
1 answer

BingBot & BaiduSpider don't respect robots.txt

After my CPU usage suddenly went over 400% due to bots swamping my site, I created a robots.txt as followed and placed the file in my root, eg "www.example.com/": User-agent: * Disallow: / Now Google respects this file and there is no more…
Richard
  • 91
  • 1
  • 3
5
votes
1 answer

YQL "Redirected to a robots.txt restricted URL" Error for Google Domain

I am using YQL Console and I want to return results from this link in Google Shopping Using the following in YQL: select content from html where…
ToddN
  • 2,901
  • 14
  • 56
  • 96
5
votes
4 answers

How can I get robots.txt to block access to URLs on site after "?" character but index page itself?

I have a small magento site which consists of page URLs such as: http://www.example.com/contact-us.html http://www.example.com/customer/account/login/ However I also have pages which include filters (e.g. price and colour) and two such examples…
5
votes
1 answer

robotext file pointing to a local sitemap

Within a robots.txt file, would be possible to use a relative path instead of an absolute one for pointing out a Sitemap? Sitemap: http://www.example.com/sitemap.xml instead: Sitemap: sitemap.xml Curiose note SO robots.txt # # this technically…
GibboK
  • 71,848
  • 143
  • 435
  • 658
5
votes
3 answers

get "Property not in account? when checking robots.txt

I see many URLs with status Excluded in search console google, when I click on "TEST ROBOTS.TXT BLOCKING" I get following error: Property not in account You are verified to see sc-domain://, but it's not in your account. when I click on add…
Arash Rabiee
  • 1,019
  • 2
  • 16
  • 31
5
votes
2 answers

robots.txt content / selenium web scraping

I am trying to run web scraping using selenium What does this robot.txt content mean? User-Agent: * Disallow: /go/ Disallow: /launch-announcement/ Can i run web scraping in all folders except go and launch-announcement?
Shabari nath k
  • 920
  • 1
  • 10
  • 23
5
votes
1 answer

How stop bots from crawling or indexing an Angular app

I want to publish an Angular app for testing purposes, but I want to make sure that the site does not get crawled or indexed by bots. I assume (might be way off!) I would add my tags simply on my index.html page, and for good measure add a…
onmyway
  • 1,435
  • 3
  • 29
  • 53
5
votes
2 answers

URL Blocking Bots

I have a client who I am trying to setup an SSL certificate for via SSL for Free, like I have done 100 times before. I created the file structure under public_html: .well-known > pki-validation > I then tried to…
Joe
  • 4,877
  • 5
  • 30
  • 51
5
votes
1 answer

How to exclude all robots except Googlebot and Bingbot with both robots.txt and X-Robots-Tag

I have 2 questions regarding crawlers and robots. Background info I only want Google and Bing to be excluded from the “disallow” and “noindex” limitations. In other words, I want ALL search engines except Google and Bing to follow the “disallow” and…
VinceJ
  • 71
  • 2
  • 5