Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
-1
votes
2 answers

Cloud Function robots.txt 404 page not found issue

I have deployed one simple app via go language in Cloud Function. *A robots.txt file is also included when distributing the app. In this regard, a simple app normally shows the image below. But it shows 404 page not found even though the…
-1
votes
1 answer

How to write robots.txt code to stop indexing the few webpages

I just want to remove some pages of my website from indexing. Like when I add case studies or blogs, I don't want to index all the blogs on my website https://snapvisibility.com/. Here is my existing robots code '''User-agent: * Disallow:…
-1
votes
1 answer

Does google bot follow pages labeled as "noindex" "follow"?

Recently I've came across the question that: what happens if you label a page as "noindex" + "follow" ? I know that "noindex" is used to tell the search engine : "I don't want you to index my page" but what happens if you set "follow" instead of "no…
Nexussim Lements
  • 535
  • 1
  • 15
  • 47
-1
votes
1 answer

why is robots.txt important? is it safe to have a website without robots.txt?

While scraping the web robots.txt matters and even regulates behavior. But for a node.js website is it necessary to have a robot's.txt? Further what is a sitemap and why is it needed as i found in the below example? User-Agent: * User-agent:…
hemant kumar
  • 545
  • 6
  • 19
-1
votes
1 answer

what's the difference between /somedir/ and /somedire/* in robots.txt?

I want to disallow a specific folder and all of its files and subdirectories but I don't know the difference between Disallow: /somedir/ and Disallow: /somedir/*. which one of these lines should I use? By the way, what does Disallow: /somedir? mean?…
user6931342
  • 145
  • 3
  • 11
-1
votes
1 answer

Allow access only to Googlebot - robots.txt

I want to allow access to a single crawler to my website - the Googlebot one. In addition, I want Googlebot to crawl and index my site according to the sitemap only. Is this the right code? I know that Only "good" bots follow the robots.txt…
dan
  • 1
  • 3
-1
votes
1 answer

search engines and robot file

i have a word-press site and don't want google to crawl in the documents in a specific folder of my media. i have created a robot.txt file which disallowed that path but i just found out that google indexed all the info in those documents. what else…
-1
votes
1 answer

How can I block search engine indexing of files and subdirectories other than root directory .php and .html files without listing the directory names?

I would like to make it so that search engines only index .html and .php files in my sites root directory and no sub directories. I want to do this without actually listing the explicitly directory names in the robots.txt file so it's not easy for…
user3000992
  • 17
  • 1
  • 7
-1
votes
2 answers

how to parse the meta tag in the webpage

Possible Duplicate: CodeIgniter: A Class/Library to help get meta tags from a web page? can any body write a simple prog for retreiving the out put as found or not found for metatags,alltags,robots.txt file
Neelesh
  • 193
  • 1
  • 12
-1
votes
2 answers

How to make a robots.txt on Django

I've seen other answers here, but they aren't really helpful, which is why I am asking. I tried the django-robots framework as well, but it gives me an error when i just put 'robots' in my INSTALLED_APPS INSTALLED_APPS =…
reydript
  • 31
  • 3
-1
votes
5 answers

How to prevent robots.txt passing from staging env to production?

I had happen in the past that one of our IT Specialist will move the robots.txt from staging from production accidentally. Blocking google and others from indexing our customers' site in production. Is there a good way of managing this…
Geo
  • 8,663
  • 13
  • 63
  • 93
-1
votes
1 answer

Error in robots.txt errors keep piling up even though it is fixed

Somebody messed up our robots.txt by accidentally adding \n after our entire allow: /products/ which are about 30.000 pages in total. The errors are on multiple language sites. This is one of our Search consoles. I quickly noticed the error and…
Kevin Tad
  • 79
  • 2
  • 9
-1
votes
1 answer

Stop changing the source file name with destination file name

I found this very cool VBA, it does what it says, but as I observed it keeps changing the source file name with destination file name Can anyone please provide an alternate line of code to stop altering source file What actually this macro does is,…
Directionsky
  • 106
  • 9
-1
votes
1 answer

Submitted URL blocked by robots.txt

In the last few weeks Google has been reporting an error in the Search Console. More and more of my pages are not allowed to crawl - Coverage report says: Submitted URL blocked by robots.txt. As you se, my robots.txt is ultra simple, why for about…
-1
votes
1 answer

How to block specific urls by using robots.txt?

How can I block specific URLs by using a robots.txt? We do not want Google to crawl our site. How can I define a disallow tag for those URLs in a robots.txt file?