Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
-2
votes
3 answers

Can we completely prevent robot accessing our web application?

As I know, if we want to prevent robots accessing our web sites we have to parse 'User-Agent' header in http request then check whether the request coming from robots or browsers. I think we can not completely prevent robot accessing our web sites…
LHA
  • 9,398
  • 8
  • 46
  • 85
-2
votes
5 answers

Robots.txt in my project root

I've seen tutorials/articles discussing using Robots.txt. Is this still a necessary practice? Do we still need to use this technique?
id.ot
  • 3,071
  • 1
  • 32
  • 47
-3
votes
1 answer

Robots.txt file and Googlebot crawability

Will this robots.txt allow Googlebot to crawl my site or not? Disallow: / User-agent: Robozilla Disallow: / User-agent: * Disallow: Disallow: /cgi-bin/ Sitemap: https://koyal.pk/sitemap/sitemap.xml
-3
votes
1 answer

Use of Robot.txt

Why will I want a Robot.txt file on my website. I know how to use it, I want to know why it is used?
-3
votes
1 answer

How does Linkedin make difference between user requests and crawler requests?

When I try to download one page from Linkedin with the following command: curl -I https://www.linkedin.com/company/google I get a 999 status code: HTTP/1.1 200 Connection established HTTP/1.1 999 Request denied Date: Tue, 30 Aug 2016 08:19:35…
-3
votes
1 answer

Website Description not showing in Google search engine

My website description not showing in Google search engine. I wrote description meta tag inside the header tag. and my robots.txt file as below User-agent: * Disallow: / When i am searching in google i am getting below message A description for…
-3
votes
1 answer

Kentico v6 questions - Regarding 301 redirect, robots file & XML Sitemap

I have Redirect non-www to www code ready but where to update the same in kentico v6? How to update robots.txt file in kentico v6? How can i add xml sitemap in kentico v6?
-3
votes
1 answer

How to insert Robots.txt file

I made a website with single page but I want to know that where to insert robots.txt file and also where to insert google anaytics code.
Tarun
  • 1
-3
votes
1 answer

How to block spider if he's disobeying the rules of robots.txt

Is there any way to block a crawler/spider search bots if they're not obeying the rules written in robots.txt file. If yes, where can I find more info about it? I would prefer some .htaccess rule, if not then PHP.
dotzzy
  • 5
  • 2
-3
votes
1 answer

How can I tell Google not to crawl a set of Urls

How do I stop google to crawl to certain urls in my application? For example: I want google to stop crawling all the URLs that starts with http://www.myhost-test.com/ What should I add in my robot.txt?
bsae
  • 45
  • 2
  • 5
-4
votes
1 answer

Django : Heroku App crashed" method=GET path="/robots.txt

My app works fine on localhost but whenever i push it to heroku it gives the error below and app crashes
Gaurav
  • 1
  • 3
-4
votes
1 answer

Mqtt data package size when simulating image data mining from a robotics device

Visual data mining is Is the process of interaction and analytical reasoning with one or more visual representations of abstract data. The process may lead to the visual discovery of robust patterns in these data or provide some guidance for the…
Tharusha
  • 635
  • 8
  • 25
-4
votes
1 answer

Facebook User Agent Crawler "Facebot/1.0"

Today our server got hit by a large number of requests from Facebook IPS in the range 66.220.159.XXX The user agent give is : "Facebot/1.0" I cant find any information on this on the facebook site, it seems its not the regular facebook user agents…
araresight
  • 19
  • 4
-5
votes
3 answers

Using Robots.txt For Finding username and Password

How can I find the Username and Password of a website by robots.txt?
Samdig
  • 5
  • 1
-5
votes
2 answers

Disallow In-Page Url Crawls

I want to disallow all the bots to crawl specific type of pages. I know this can be done via robots.txt as well as .htaccess. However, these pages are generated from the database from the user's request. I have searched the internet and could not…
Abhimanyu Saharan
  • 642
  • 1
  • 10
  • 26
1 2 3
95
96