Questions tagged [robots.txt]

Convention to prevent webcrawlers from indexing your website.

If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root of the web site hierarchy (e.g. www.example.com/robots.txt). This text file should contain the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the web owner wishes to provide no specific instructions.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.

Source: wikipedia

86 questions
3
votes
2 answers

Meaning of Disallow: /*? in robots.txt

Yahoo's robots.txt contains: User-agent: * Disallow: /p/ Disallow: /r/ Disallow: /*? What does the last line mean? ("Disallow: /*?")
hussain
2
votes
2 answers

Rewrite robots.txt based on host with htaccess

I'm trying to rewrite a filename based on the server's domain. This code below is wrong / not working, but illustrates the desired effect. RewriteRule "^/robots\.txt$" "robots-staging.txt" …
Jay
  • 157
  • 1
  • 7
2
votes
1 answer

What's with random-character queries coming from googlebot, e.g., vvytnoxvontwusz.html?

One of my sites has been getting queries from googlebot, on the order of: example-log:66.249.79.216 - - [06/Apr/2016:15:36:56 -0700] "GET /vvytnoxvontwusz.html HTTP/1.1" 404 15136 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;…
Jim Miller
  • 713
  • 2
  • 11
  • 23
2
votes
1 answer

Remove ?=collcc from url

Google Webmasters Tools has notified me about too many duplicated URLs. Some parameters have been added that I don't know about and I need to remove it, for…
user994461
  • 133
  • 5
2
votes
3 answers

robots.txt and other .txt returning 404 on IIS?

We have an IIS site running Dotnetnuke that we took over from another group. We have added a robots.txt file to the root but it returns a 404. Actually any txt file in the root seems to return 404. I can't seem to spot where they may have blocked…
schooner2000
  • 201
  • 3
  • 5
2
votes
1 answer

Google-Bot fell in love with my 404-page

Every day my access-log looks kind of this: 66.249.78.140 - - [21/Oct/2013:14:37:00 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.78.140 - - [21/Oct/2013:14:37:01…
32bitfloat
  • 253
  • 2
  • 3
  • 10
2
votes
1 answer

Ideal robots.txt for a gitweb installation?

I host a few git repositories at git.nomeata.de using gitweb (and gitolite). Occasionally, a search engine spider comes along and begins to hammer the interface. While I generally do want my git repositories to show up in search engines, I do not…
Joachim Breitner
  • 3,779
  • 3
  • 18
  • 21
2
votes
3 answers

Block Offline Browsers

Is there a way to block offline browsers (like Teleport Pro, Webzip, etc...) that are showed in the logs as "Mozilla"? Example: Webzip is showed in my site logs as "Mozilla/4.0 (compatible; MSIE 8.0; Win32)" Teleport Pro is showed in my site logs as…
Alex
  • 21
  • 2
2
votes
2 answers

IIS Spikes in anonymous users - crippling my server

I have a server running windows server 2008 R2, recently my websites have becoming unresponsive at least once a day, seemingly at random intervals. I have installed some monitoring software and noticed that the anonymous user count spikes when this…
Paul Hinett
  • 1,205
  • 3
  • 11
  • 19
2
votes
1 answer

Use robots.txt to prevent crawlers from getting old versions of Trac pages

looking at my Apache access.log I see that crawlers tend to get old versions of pages and documents, like: 119.63.196.86 - - [10/Jun/2011:10:36:31 +0200] "GET /wiki/News?version=14 HTTP/1.1" 200 6073 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0;…
2
votes
3 answers

How much HDD space would I need to cache the web while respecting robot.txts?

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. I'll respect robot.txts. I save all html, pdf,…
user42235
1
vote
0 answers

How to block bad url path that is not part of my site from showing in google search?

I have got a site that is running on Node.js (Express) , and Apache httpd. Hundreds of requests are coming in from malicious IP's, which I'm proactively blocking. (I have a script that looks at the logs, and if it sees malicious terms, it blocks…
xDG
  • 123
  • 1
  • 4
1
vote
1 answer

robots.txt route requires a backslash when behind an Application Load Balancer

I have a rails site using an AWS ALB and all routes appear to work except one, robots.txt. I am getting the error "ERR_TOO_MANY_REDIRECTS", link to example: https://www.mamapedia.com/robots.txt After some research I found many places that said the…
1
vote
1 answer

Redirect "robots.txt" on specific domain

I want to redirect all requests on "robots.txt" if the domain contains ".our-internal-devel-domain.de". It should be server-wide, because when we develop a website and publish it over our test-domain, I dont want to have it on google so i want to…
chmod777
  • 11
  • 1
1
vote
1 answer

High no of hits by facebook crawler on server

There are daily about 3000 404 hits or more from facebook crawler. Log is as X.X.X.X Y.Y.Y.Y - - [24/May/2017:03:43:35 +0000] "GET /health-and-medicine/trumps-2018-budget-cuts-funding-for-cancer-mental-health-and-hiv-research/ HTTP/1.1" 404 292…
YATIN GUPTA
  • 203
  • 1
  • 2
  • 9