Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
9
votes
4 answers

How can i fix "Googlebot can't access your site" issue?

I just keep getting a message about "Over the last 24 hours, Googlebot encountered 1 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall…
Jason
  • 221
  • 2
  • 3
  • 7
9
votes
3 answers

How to add route to dynamic robots.txt in ASP.NET MVC?

I have a robots.txt that is not static but generated dynamically. My problem is creating a route from root/robots.txt to my controller action. This works: routes.MapRoute( name: "Robots", url: "robots", defaults: new { controller = "Home", action =…
JSS
  • 400
  • 2
  • 14
9
votes
1 answer

Any reason to not do a 301 on favicon.ico, apple-touch-icon, and robots.txt?

I would like to redirect requests for these resources to my CDN. Is there any reason to not do this?
John Bachir
  • 22,495
  • 29
  • 154
  • 227
8
votes
3 answers

block google robots for URLS containing a certain word

my client has a load of pages which they dont want indexed by google - they are all called http://example.com/page-xxx so they are /page-123 or /page-2 or /page-25 etc Is there a way to stop google indexing any page that starts with /page-xxx…
JorgeLuisBorges
  • 528
  • 3
  • 8
  • 21
8
votes
2 answers

Excluding testing subdomain from being crawled by search engines (w/ SVN Repository)

I have: domain.com testing.domain.com I want domain.com to be crawled and indexed by search engines, but not testing.domain.com The testing domain and main domain share the same SVN repository, so I'm not sure if separate robots.txt files would…
8
votes
3 answers

Anybody got any C# code to parse robots.txt and evaluate URLS against it

Short question: Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not. Long question: I have been creating a sitemap for a new site yet to be released to google. The sitemap has two…
Simon_Weaver
  • 140,023
  • 84
  • 646
  • 689
8
votes
1 answer

Regexp for robots.txt

I am trying to set up my robots.txt, but I am not sure about the regexps. I've got four different pages all available in three different languages. Instead of listing each page times 3, I figured I could use a regexp. nav.aspx page.aspx/changelang…
patad
  • 9,364
  • 11
  • 38
  • 44
8
votes
4 answers

Unable to map route for robots.txt in asp.net mvc

I am developing an asp.net mvc application. I am creating robots.txt for my application to prevent from bots because my current site is getting many robot requests. So I found this link, Robots.txt file in MVC.NET 4 to create robots.txt. But I when…
Wai Yan Hein
  • 13,651
  • 35
  • 180
  • 372
8
votes
3 answers

What does "Allow: /$" mean in robots.txt

When digging through a Google robots.txt file I noticed a line that I was not familiar with. What does the below code mean in the context of a robots.txt file? Allow: /$ Does the '$' change the meaning any from simply saying Allow: /
Kyle Piira
  • 596
  • 6
  • 8
8
votes
4 answers

Googlebots Ignoring robots.txt?

I have a site with the following robots.txt in the root: User-agent: * Disabled: / User-agent: Googlebot Disabled: / User-agent: Googlebot-Image Disallow: / And pages within this site are getting scanned by Googlebots all day long. Is there…
Tim Scott
  • 15,106
  • 9
  • 65
  • 79
8
votes
1 answer

Disallow certain page directories but NOT that page itself

Let's say, I have a dynamic page that creates URL's from user inputs. For example: www.XXXXXXX.com/browse <-------- (Browse being the page) Every time user enters some query, it generates more pages. For example: www.XXXXXXX.com/browse/abcd…
Raj Sandhu
  • 83
  • 4
8
votes
4 answers

Where to put robots.txt file?

Where should put robots.txt? domainname.com/robots.txt or domainname/public_html/robots.txt I placed the file in domainname.com/robots.txt, but it's not opening when I type this in browser. alt text…
Jitendra Vyas
  • 148,487
  • 229
  • 573
  • 852
8
votes
3 answers

robots.txt allow all except few sub-directories

I want my site to be indexed in search engines except few sub-directories. Following are my robots.txt settings: robots.txt in the root directory User-agent: * Allow: / Separate robots.txt in the sub-directory (to be excluded) User-agent:…
Kunwarbir S.
  • 281
  • 1
  • 3
  • 13
8
votes
3 answers

Need to block subdomain using robots.txt which is on same directory level

I have one problem I have domain name for example www.testing.com and new.testing.com so i do not want to new.testing.com display in any search engine. I have added one robots.txt to the new.testing.com. And both site has same parent…
Jalpesh Patel
  • 3,150
  • 10
  • 44
  • 68
8
votes
2 answers

Block a site from search engine - DuckDuckGo

I have a development site https://text-domain.com. (not a real site) When I go to https://duckduckgo.com and search for text-domain.com, it does return results. What have I tried so far: Created robots.txt file with following code(put in in my root…
Vimalnath
  • 6,373
  • 2
  • 26
  • 47