Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
8
votes
2 answers

Multiple User Agents in Robots.txt

In robots.txt file I have following sections User-Agent: Bot1 Disallow: /A User-Agent: Bot2 Disallow: /B User-Agent: * Disallow: /C Will statement Disallow:c be visible to Bot1 & Bot2 ?
GoodSp33d
  • 6,252
  • 4
  • 35
  • 67
7
votes
1 answer

FastAPI, robots.txt and noindex

Does FastAPI need robots.txt and the tag noindex? I am creating business api app which shouldn't be called by anonymous. So I wonder whether I have to prepare robots.txt and the tag noindex in order to avoid any crawler's action or not. I made…
tomo
  • 71
  • 2
7
votes
1 answer

Java robots.txt parser with wildcard support

I'm looking for a robots.txt parser in Java, which supports the same pattern matching rules as the Googlebot. I've found some librairies to parse robots.txt files, but none of them supports Googlebot-style pattern matching : Heritrix (there is an…
clement
  • 81
  • 5
7
votes
1 answer

Should sitemap be disallowed in robots.txt? and robot.txt itself?

This a very basic question, but I can't find a direct answer anywhere online. When searching for my website on google, sitemap.xml and robots.txt are returned as search results (amongst more useful results). To prevent this should I add the…
RLJ
  • 135
  • 2
  • 6
  • 10
7
votes
1 answer

"Lighthouse was unable to download a robots.txt file" despite the file being accessible

I have a NodeJS/NextJS app running at http://www.schandillia.com. The project has a robots.txt file accessible at http://www.schandillia.com/robots.txt. As of now, the file is bare-bones for testing purposes: User-agent: * Allow: / However, when I…
TheLearner
  • 2,813
  • 5
  • 46
  • 94
7
votes
1 answer

React router v4 serve static file (robot.txt)

How can I put my robots.txt file to the path www.domain.com/robots.txt? No server is used, its only frontend with react router. robots.txt --> in root folder ./ app.js --> in src folder ./src/ (...) export class App extends React.Component { …
7
votes
2 answers

Robots.txt: Disallow subdirectory but allow directory

I want to allow crawling of files in: /directory/ but not crawling of files in: /directory/subdirectory/ Is the correct robots.txt instruction: User-agent: * Disallow: /subdirectory/ I'm afraid that if I disallowed /directory/subdirectory/ that I…
user523521
  • 121
  • 1
  • 8
7
votes
2 answers

Twitter meta image is not rendering on Twitter because it "may be restricted by the site's robots.txt file"

So this is the link while I tried using Twitter the image somehow doesn't work, while it works for Facebook. It is working for Facebook only but for Twitter I am getting issue: WARN: The image URL…
ujwal dhakal
  • 2,289
  • 2
  • 30
  • 50
7
votes
1 answer

Robots.txt: Is this wildcard rule valid?

Simple question. I want to add: Disallow */*details-print/ Basically, blocking rules in the form of /foo/bar/dynamic-details-print --- foo and bar in this example can also be totally dynamic. I thought this would be simple, but then on…
Bartek
  • 15,269
  • 2
  • 58
  • 65
7
votes
1 answer

Stop google indexing subdomain

I have subdomain "klient" for testing websites for our clients and I don't want that to be indexed. I have set in robots.txt (in root of our web) this: User-agent: * disallow: /subdom/klient/* But I'm not sure, if it does really work, because I…
stepik21
  • 2,610
  • 3
  • 22
  • 32
7
votes
5 answers

BOT/Spider Trap Ideas

I have a client whose domain seems to be getting hit pretty hard by what appears to be a DDoS. In the logs it's normal looking user agents with random IPs but they're flipping through pages too fast to be human. They also don't appear to be…
Mikey1980
  • 971
  • 4
  • 15
  • 24
7
votes
3 answers

Best practice to create robots.txt file inside my asp.net mvc web site

I want to create a robots.txt for my asp.net mvc-5 web site, now I find this link which talks about achieving this task:- http://rehansaeed.com/dynamically-generating-robots-txt-using-asp-net-mvc/ where in this link they are creating a separate…
user1404577
7
votes
3 answers

robots.txt parser java

I want to know how to parse the robots.txt in java. Is there already any code?
zahir hussain
  • 3,711
  • 10
  • 29
  • 36
7
votes
1 answer

Django - Loading Robots.txt through generic views

I have uploaded robots.txt into my templates directory on my production server. I am using generic views; from django.views.generic import TemplateView (r'^robots\.txt$', TemplateView.as_view(template_name='robots.txt',…
Uma
  • 689
  • 2
  • 12
  • 35
7
votes
2 answers

Azure domain being indexed by google

I have a website that has a domain 'example.azurewebsites.net'. I also have a custom domain configured for it 'www.example.com'. Google is indexing my 'example.azurewebsites.net' website and I want it to stop and only index it has…
user342706