Questions tagged [robots.txt]

Convention to prevent webcrawlers from indexing your website.

If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root of the web site hierarchy (e.g. www.example.com/robots.txt). This text file should contain the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the web owner wishes to provide no specific instructions.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.

Source: wikipedia

86 questions
1
vote
1 answer

Custom robots.txt being overwritten in Azure IIS 8 by something

We have a custom robots.txt in the root of our IIS cloud service Azure website that does not display correctly when navigating to www.oursite.com/robots.txt . A “different” robots.txt file displays containing: User-agent: LinkChecker Allow: / …
Brian
  • 11
  • 2
1
vote
0 answers

How to block fake google spider and fake web browser access?

Recently I found that someguys are trying to mirror my website. They are doing this in two ways: Pretend to be google spiders . Access logs are as following: 89.85.93.235 - - [05/May/2015:20:23:16 +0800] "GET /robots.txt HTTP/1.0" 444 0…
Meteor
  • 151
  • 1
  • 6
1
vote
1 answer

apache robots.txt with SSL

I have an .htaccess file with a rewrite rule to get a redirect of every HTTP request to HTTPS. But now I have a problem that my robots.txt is not recognized by some online checker. If I remove the redirect from the .htaccess file the robots.txt is…
user224013
  • 13
  • 4
1
vote
1 answer

Google-bot trips on a perfectly normal robots.txt, then on a nonexistent robots.txt

I have two domain names pointing to the same virtual server. One of them, http://ilarikaila.com, is a working brochure website I made for a friend. I used the other one, http://teemuleisti.com, to test-drive the site before making it public – in…
Teemu Leisti
  • 123
  • 8
1
vote
1 answer

Why is my robots.txt not working?

I have this robots.txt: User-Agent: * Disallow: /files/ User-Agent: ia_archiver Allow: / User-agent: Googlebot Disallow: User-agent: googlebot-image Disallow: User-agent: googlebot-mobile Disallow: I am finding that PDF files in the…
MB34
  • 167
  • 2
  • 10
1
vote
2 answers

Thousands of robots.txt 404 errors from bots trying to crawl old multisite

Current situation is that we are getting thousands and thousands of 404 errors from bots looking for robots.txt in different places on our site due to domain redirects. Our old website was a labyrinthine multisite powered by dotnetnuke with multiple…
1
vote
1 answer

Googlebot cant access my site webmaster tools reply Unreachable robots.txt

When I try to fetch my site as a googlebot in webmaster tools it return Unreachable robots.txt, after investigate I understood google bot can see my server: tcpdump | grep google It returns that google can access my server with IP aa.bb.cc.xx or…
1
vote
3 answers

Dynamic robots.txt based on hostname

Is there a way to swap out a robots.txt file in nginx based on hostname? I currently have www.domain.com and backup.domain.com pointing at the same nginx server, but I don't want Google indexing backup.domain.com.
Noodles
  • 1,386
  • 3
  • 18
  • 29
1
vote
2 answers

Blocking bad bots

I found this script and was wondering if this is just overkill and even worth using? Is it better for me to just use mod_security? # Generated using http://solidshellsecurity.com services # Begin block Bad-Robots from robots.txt User-agent:…
Tiffany Walker
  • 6,681
  • 14
  • 56
  • 82
1
vote
0 answers

How to block urls that request for robots.txt in lighttpd?

We have a dedicated development server which runs only test PHP applications on a public network. We have setup session-based authentication for the site. The issue we have is there are lots of 404s logged in access log for robots.txt. So, We want…
Vishnu Kumar
  • 131
  • 5
1
vote
2 answers

How to create a global robots.txt that gets appended to each domain's own robots.txt on Apache?

I know can create ONE robots.txt file for all domains on an Apache server*, but I want to append to each domain's (if pre-existing) robots.txt. I want some general rules in place for all domains, but I need to allow different domains to have their…
Gaia
  • 1,855
  • 5
  • 34
  • 60
1
vote
1 answer

How to Disallow Particular Path in robots.txt

I want to disallow /path but also wanna allow /path/another-path in robots.txt. I already tried: Disallow: /path Or: Disallow: /path$ But doesn't work, I mean it blocked /path/another-path too. Is it possible to do that? Any help would be…
Martin
1
vote
1 answer

If denying crawlers access to a directory via robots.txt, will it still index a file in that directory if I direct link?

I am denying indexing to a folder called pdf via robots.txt. However, I do direct link to a few files that exist in that directory. Will search engines such as Google index those files, or ignore them because they reside in the pdf folder?
kylex
  • 1,421
  • 5
  • 14
  • 18
1
vote
1 answer

Does a forward web proxy exist that checks and obeys robots.txt on remote domains?

Does there exist a forward proxy server that will lookup and obey robots.txt files on remote internet domains and enforce them on behalf of requesters going via the proxy? e.g. Imagine a website at www.example.com that has a robots.txt file that…
wodow
  • 590
  • 1
  • 6
  • 18
1
vote
2 answers

How to block Spiders / Scrapers by REMOTE hostname / domain for all Virtual Hosts in Apache?

I've seen plenty of robot.txt stuff, and some mod-rewrite solutions that looked promising… but haven't been able to find a simple solution to block Spiders / Scrapers / whoever the hell I want to block… I'd rather do this by hostname / domain, as…
mralexgray
  • 1,353
  • 3
  • 12
  • 29