Questions tagged [robots.txt]

Convention to prevent webcrawlers from indexing your website.

If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root of the web site hierarchy (e.g. www.example.com/robots.txt). This text file should contain the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the web owner wishes to provide no specific instructions.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.

Source: wikipedia

86 questions

votes

2 answers

Meaning of Disallow: /*? in robots.txt

Yahoo's robots.txt contains: User-agent: * Disallow: /p/ Disallow: /r/ Disallow: /*? What does the last line mean? ("Disallow: /*?")

robots.txt

asked May 06 '10 at 06:28

hussain

votes

2 answers

Rewrite robots.txt based on host with htaccess

I'm trying to rewrite a filename based on the server's domain. This code below is wrong / not working, but illustrates the desired effect. RewriteRule "^/robots\.txt$" "robots-staging.txt" …

apache-2.4 .htaccess robots.txt

asked Nov 18 '17 at 00:10

Jay

votes

1 answer

What's with random-character queries coming from googlebot, e.g., vvytnoxvontwusz.html?

One of my sites has been getting queries from googlebot, on the order of: example-log:66.249.79.216 - - [06/Apr/2016:15:36:56 -0700] "GET /vvytnoxvontwusz.html HTTP/1.1" 404 15136 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;…

apache-2.2 robots.txt googlebot

asked Apr 07 '16 at 00:43

Jim Miller

votes

1 answer

Remove ?=collcc from url

Google Webmasters Tools has notified me about too many duplicated URLs. Some parameters have been added that I don't know about and I need to remove it, for…

apache-2.4 .htaccess mod-rewrite robots.txt

asked Feb 17 '16 at 09:52

user994461

votes

3 answers

robots.txt and other .txt returning 404 on IIS?

We have an IIS site running Dotnetnuke that we took over from another group. We have added a robots.txt file to the root but it returns a 404. Actually any txt file in the root seems to return 404. I can't seem to spot where they may have blocked…

iis http-status-code-404 robots.txt

asked Aug 31 '09 at 14:54

schooner2000

votes

1 answer

Google-Bot fell in love with my 404-page

Every day my access-log looks kind of this: 66.249.78.140 - - [21/Oct/2013:14:37:00 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.78.140 - - [21/Oct/2013:14:37:01…

http-status-code-404 robots.txt googlebot

asked Oct 21 '13 at 20:28

32bitfloat

votes

1 answer

Ideal robots.txt for a gitweb installation?

I host a few git repositories at git.nomeata.de using gitweb (and gitolite). Occasionally, a search engine spider comes along and begins to hammer the interface. While I generally do want my git repositories to show up in search engines, I do not…

robots.txt git

asked May 10 '13 at 08:28

Joachim Breitner

3,779
3
18
21

votes

3 answers

Block Offline Browsers

Is there a way to block offline browsers (like Teleport Pro, Webzip, etc...) that are showed in the logs as "Mozilla"? Example: Webzip is showed in my site logs as "Mozilla/4.0 (compatible; MSIE 8.0; Win32)" Teleport Pro is showed in my site logs as…

.htaccess robots.txt

asked Jan 03 '13 at 09:40

Alex

votes

2 answers

IIS Spikes in anonymous users - crippling my server

I have a server running windows server 2008 R2, recently my websites have becoming unresponsive at least once a day, seemingly at random intervals. I have installed some monitoring software and noticed that the anonymous user count spikes when this…

windows-server-2008 iis robots.txt

asked Dec 27 '12 at 14:09

Paul Hinett

1,205
3
11
19

votes

1 answer

Use robots.txt to prevent crawlers from getting old versions of Trac pages

looking at my Apache access.log I see that crawlers tend to get old versions of pages and documents, like: 119.63.196.86 - - [10/Jun/2011:10:36:31 +0200] "GET /wiki/News?version=14 HTTP/1.1" 200 6073 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0;…

apache-2.2 trac robots.txt

asked Jun 10 '11 at 08:47

Andrea Spadaccini

votes

3 answers

How much HDD space would I need to cache the web while respecting robot.txts?

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. I'll respect robot.txts. I save all html, pdf,…

web cache robots.txt web-crawler

asked Jun 05 '10 at 12:56

user42235

vote

0 answers

How to block bad url path that is not part of my site from showing in google search?

I have got a site that is running on Node.js (Express) , and Apache httpd. Hundreds of requests are coming in from malicious IP's, which I'm proactively blocking. (I have a script that looks at the logs, and if it sees malicious terms, it blocks…

apache-2.2 security node.js ufw robots.txt

asked Jul 21 '19 at 03:55

xDG

vote

1 answer

robots.txt route requires a backslash when behind an Application Load Balancer

I have a rails site using an AWS ALB and all routes appear to work except one, robots.txt. I am getting the error "ERR_TOO_MANY_REDIRECTS", link to example: https://www.mamapedia.com/robots.txt After some research I found many places that said the…

amazon-web-services amazon-elb robots.txt amazon-alb

asked Sep 13 '18 at 19:39

6557457iD9e

vote

1 answer

Redirect "robots.txt" on specific domain

I want to redirect all requests on "robots.txt" if the domain contains ".our-internal-devel-domain.de". It should be server-wide, because when we develop a website and publish it over our test-domain, I dont want to have it on google so i want to…

apache-2.4 configuration rewrite httpd robots.txt

asked Mar 10 '18 at 22:20

chmod777

vote

1 answer

High no of hits by facebook crawler on server

There are daily about 3000 404 hits or more from facebook crawler. Log is as X.X.X.X Y.Y.Y.Y - - [24/May/2017:03:43:35 +0000] "GET /health-and-medicine/trumps-2018-budget-cuts-funding-for-cancer-mental-health-and-hiv-research/ HTTP/1.1" 404 292…

linux security mod-security robots.txt

asked May 24 '17 at 11:25

YATIN GUPTA

Prev 1

3 4 5 6 Next