Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions

-2

votes

3 answers

Can we completely prevent robot accessing our web application?

As I know, if we want to prevent robots accessing our web sites we have to parse 'User-Agent' header in http request then check whether the request coming from robots or browsers. I think we can not completely prevent robot accessing our web sites…

web user-agent robots.txt

asked Dec 12 '13 at 20:37

LHA

9,398
8
46
85

-2

votes

5 answers

Robots.txt in my project root

I've seen tutorials/articles discussing using Robots.txt. Is this still a necessary practice? Do we still need to use this technique?

seo robots.txt

asked Apr 23 '13 at 18:25

id.ot

3,071
1
32
47

-3

votes

1 answer

Robots.txt file and Googlebot crawability

Will this robots.txt allow Googlebot to crawl my site or not? Disallow: / User-agent: Robozilla Disallow: / User-agent: * Disallow: Disallow: /cgi-bin/ Sitemap: https://koyal.pk/sitemap/sitemap.xml

web-crawler robots.txt

asked Jan 09 '23 at 09:22

Stream Koyal

-3

votes

1 answer

Use of Robot.txt

Why will I want a Robot.txt file on my website. I know how to use it, I want to know why it is used?

seo robots.txt

asked Feb 10 '18 at 07:10

Bisma Latif

-3

votes

1 answer

How does Linkedin make difference between user requests and crawler requests?

When I try to download one page from Linkedin with the following command: curl -I https://www.linkedin.com/company/google I get a 999 status code: HTTP/1.1 200 Connection established HTTP/1.1 999 Request denied Date: Tue, 30 Aug 2016 08:19:35…

web-scraping scrapy web-crawler linkedin-api robots.txt

asked Aug 30 '16 at 08:44

Gabsn

-3

votes

1 answer

Website Description not showing in Google search engine

My website description not showing in Google search engine. I wrote description meta tag inside the header tag. and my robots.txt file as below User-agent: * Disallow: / When i am searching in google i am getting below message A description for…

robots.txt

asked May 05 '16 at 12:12

user3363965

-3

votes

1 answer

Kentico v6 questions - Regarding 301 redirect, robots file & XML Sitemap

I have Redirect non-www to www code ready but where to update the same in kentico v6? How to update robots.txt file in kentico v6? How can i add xml sitemap in kentico v6?

content-management-system kentico robots.txt xml-sitemap

asked Mar 10 '16 at 06:50

Suri K

-3

votes

1 answer

How to insert Robots.txt file

I made a website with single page but I want to know that where to insert robots.txt file and also where to insert google anaytics code.

google-analytics robots.txt

asked Feb 03 '16 at 11:07

Tarun

-3

votes

1 answer

How to block spider if he's disobeying the rules of robots.txt

Is there any way to block a crawler/spider search bots if they're not obeying the rules written in robots.txt file. If yes, where can I find more info about it? I would prefer some .htaccess rule, if not then PHP.

php robots.txt

asked Mar 13 '15 at 13:46

dotzzy

-3

votes

1 answer

How can I tell Google not to crawl a set of Urls

How do I stop google to crawl to certain urls in my application? For example: I want google to stop crawling all the URLs that starts with http://www.myhost-test.com/ What should I add in my robot.txt?

web-crawler robots.txt

asked Jul 18 '12 at 13:47

bsae

-4

votes

1 answer

Django : Heroku App crashed" method=GET path="/robots.txt

My app works fine on localhost but whenever i push it to heroku it gives the error below and app crashes

django heroku robots.txt

asked Dec 26 '17 at 09:41

Gaurav

-4

votes

1 answer

Mqtt data package size when simulating image data mining from a robotics device

Visual data mining is Is the process of interaction and analytical reasoning with one or more visual representations of abstract data. The process may lead to the visual discovery of robust patterns in these data or provide some guidance for the…

data-visualization mqtt data-mining robots.txt

asked Oct 30 '16 at 20:53

Tharusha

-4

votes

1 answer

Facebook User Agent Crawler "Facebot/1.0"

Today our server got hit by a large number of requests from Facebook IPS in the range 66.220.159.XXX The user agent give is : "Facebot/1.0" I cant find any information on this on the facebook site, it seems its not the regular facebook user agents…

facebook user-agent bots robots.txt web-crawler

asked Jul 23 '14 at 16:05

araresight

-5

votes

3 answers

Using Robots.txt For Finding username and Password

How can I find the Username and Password of a website by robots.txt?

robots.txt

asked Jul 16 '11 at 19:15

Samdig

-5

votes

2 answers

Disallow In-Page Url Crawls

I want to disallow all the bots to crawl specific type of pages. I know this can be done via robots.txt as well as .htaccess. However, these pages are generated from the database from the user's request. I have searched the internet and could not…

.htaccess web-crawler robots.txt google-crawlers

asked Feb 20 '15 at 18:29

Abhimanyu Saharan

Prev 1 2 3

…

96 Next