Questions tagged [robots.txt]

Convention to prevent webcrawlers from indexing your website.

If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root of the web site hierarchy (e.g. www.example.com/robots.txt). This text file should contain the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the web owner wishes to provide no specific instructions.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.

Source: wikipedia

86 questions
1
vote
2 answers

Is there a way to disallow robots crawling through IIS Management Console for entire site

Can I do the same as robots.txt through IIS settings? Telling User-agent: * Disallow: / in host header or through web.config?
jpkeisala
  • 166
  • 8
1
vote
2 answers

Weird entry in access.log on Apache 2.2

I'm running Apache 2.2, and my server runs well. Noticed this weird anomaly in my access.log file, how should I prevent it? robots.txt doesn't seem to be working. 127.0.0.1 - - [17/Apr/2011:12:17:00 +0100] "GET / HTTP/1.1" 200 3022 "-" "msnbot/1.1…
1
vote
3 answers

Should I ban spiders?

A rails template script that I've been looking at automatically adds User-Agent: and Dissalow: in robots.txt thereby banning all spiders from the site What are the benefits of banning spiders and why would you want to?
marflar
  • 397
  • 1
  • 2
  • 9
1
vote
2 answers

robots.txt file with more restrictive rules for certain user agents

I'm a bit vague on the precise syntax of robots.txt, but what I'm trying to achieve is: Tell all user agents not to crawl certain pages Tell certain user agents not to crawl anything (basically, some pages with enormous amounts of data should…
Carson63000
  • 111
  • 3
1
vote
3 answers

Does GoogleBot respect User-agent: *

I blocked a page in robots.txt under User-agent: *, and tried to do a manual removal of that URL from Google's cache in the webmasters tools. Google said it wasn't being blocked in my robots.txt, so I then blocked it specifically under User-agent:…
user40696
  • 113
  • 5
1
vote
7 answers

Robots.txt command

I have a bunch of files at www.example.com/A/B/C/NAME (A,B,C change around, NAME is static) and I basically want to add a command in robots.txt so crawlers don't follow any such links that have NAME at the end. What's the best command to use in…
Mike F
1
vote
0 answers

What are they trying to get with "GET /public-projects"

I haven't even shared my website with anyone yet and I have already started seeing attempts to GET /public-projects. However, I couldn't get any information about it, what are they trying to get? The bots are from google, ahrefs, semrush, etc. Am I…
fersarr
  • 111
  • 1
  • 4
1
vote
1 answer

Tons of Access from Google Proxy

I freaquently have a lots of access from google proxy. It says it is Google Favicon bot and I've checked it by host command. User-agent is like following. "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75…
sasa
  • 11
  • 1
0
votes
1 answer

Wordpress - Yoast SEO - robots.txt

I was just reading up on Yoast website here on how the correct robots.txt file should look like with the latest SEO practices. In their example Yoast uses the following: User-Agent: * Disallow: /suggest/?* What exactly does line Disallow:…
0
votes
2 answers

How to serve robots.txt for all my own subdomains but not other hosts on Apache?

We develop websites and we host the QA environment on the same server as the production environment. I want to serve a specific robots.txt for all QA sites but not for the production sites. We have a lot of sites so I do not want to manyallu update…
Sander Marechal
  • 289
  • 4
  • 11
0
votes
1 answer

Robots.txt with several VirtualHosts

My web server is running (apache 2.4.10) different virtual hosts for the following domain name : foo.example.com bar.example.com www.example.com example.com Here is the configuration file for my vhosts : DocumentRoot…
Kiwi387
  • 3
  • 2
0
votes
1 answer

Robots file behaviour

I've noticed something odd today. if i go to http://www.google.com/robots.txt, IE11 shows me the contents of google's robot file. However if i go to my site (still in development) using the same browser and point it to robots.txt, IE asks if i want…
537mfb
  • 167
  • 1
  • 11
0
votes
2 answers

how to write RewriteCond for a url with subdomain

I am trying to ban some bots by writing a RewriteCond rule in htaccess file. Is the following ruleset correct if I add the following: ## block traffic from particular referrers RewriteEngine On RewriteCond %{HTTP_REFERER}…
developer
  • 555
  • 2
  • 8
  • 16
0
votes
0 answers

Allocating percent resources for all robots

As many who posted here, my apache server is getting brutally hammerred to near death by robots, most of which are good robots. There is no way to change their crawl rate. Anyone have a suggestion as to how to solve this problem? I thought perhaps…
Ray S.
  • 101
  • 3
0
votes
2 answers

Ideal robots.txt for WordPress?

I browsed the web trying to find the ideal robots.txt content for a hosted WordPress blog. I found several options, for example here and here. I thought this would be a good question for ServerFault: for a "simple" blog over WordPress, what would be…
Roee Adler
  • 266
  • 1
  • 2
  • 10