Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
0
votes
1 answer

Blocking files in robots.txt with [possibly] more than one file extension

Is this correct syntax? Disallow: /file_name.* If not, is there are way to accomplish this without listing each file twice [multiple times]?
DaedBaet
  • 429
  • 1
  • 5
  • 17
0
votes
2 answers

How to let google crawl pdf files but not index them?

if i understand it right, you can only tell google to crawl or not crawl pdf files via robots.txt. i want google to crawl the files, but not list them in the search results pages. is this possible?
mostwanted
  • 1,549
  • 3
  • 13
  • 21
0
votes
1 answer

why this wrong in robot.txt file even after specifing URL?

In robot.txt file I have put a URL /custompages/* and google bot should not crawl the pages which are matched with "/custompages/". But when I looked into webmaster, I can still see the error messages from those links. User-agent: * …
Manojkumar
  • 1,351
  • 5
  • 35
  • 63
0
votes
1 answer

How and why do they redirect their robots.txt file to their homepage on their site?

A robots.txt file is usually just a text file under your site root directory. For example, you can view www.amazon.com/robots.txt. But today, I found a website with a strange robots.txt file. If you just type http://xli.bugs3.com/robots.txt it…
fanchyna
  • 2,623
  • 7
  • 36
  • 38
0
votes
1 answer

Prevent search engines from indexing script URLs

How to prevent search engines from indexing script URLs like: domain.tld/?[whatever is here] robots.txt User-agent: * Disallow: /? does seem to work. But still to allow indexing of the main page.
Alex G
  • 3,048
  • 10
  • 39
  • 78
0
votes
1 answer

Presence of .htaccess File in Root Directory When Using Wordpress blog in Subdirectory

I have an .htaccess file in the root directory of all my sites which handles canonical rewrites. The entire content of the .htaccess file is as follows: Options +Indexes Options +FollowSymLinks RewriteEngine on RewriteCond %{HTTP_HOST}…
0
votes
1 answer

Disallow dynamic pages in robots.txt

How would I disallow all dynamic pages within my robots.txt? E.g. page.php?hello=there page.php?hello=everyone page.php?thank=you I would like page.php AND all possible dynamic versions to be disallowed. At the moment I have User-Agent: * Disallow:…
Hey There
  • 25
  • 5
0
votes
2 answers

Robots.txt to allow a few specific crawlers and deny all others

I have been getting a lot of CPU spikes recently on my server and somehow I believe it's not the real traffic or some part of it isn't real. So I want to only allow Google bots, MSN and Yahoo for now. Please guide me if the following robots.txt file…
0
votes
1 answer

Accessing robots.txt file in java

I am new in java.I want to make a simple web crawler.how to access a robots.txt file for a website in java.actually i dont know much about robots.txt. plz help me out.
Toukir Naim
  • 161
  • 1
  • 7
-1
votes
1 answer

Using robots.txt to exclude one specific user-agent and allowing all others?

It sounds like a simple question. Exclude the waybackmachine crawler (ia_archiver) and allow all other user agents. So I setup the robots.txt as follows: User-agent: * Sitemap: https://www.example.com/sitemap.xml User-agent: ia_archiver Disallow:…
Avatar
  • 14,622
  • 9
  • 119
  • 198
-1
votes
1 answer

SemrushBot cannot be stopped

In the last few days I was monitoring my website logs and saw a bot that is scanning me a lot. The interval of scanning is very frequently, once of every 5-10 seconds. I was trying to block the bot by write the next code into robots.txt, but after 1…
kevx
  • 79
  • 9
-1
votes
1 answer

Stop some subdomains xxx.xxx.com of a single website being indexed by search engines

I need to stop indexing subdomains of a same site, for example: aaa.xxx.com: (No indexing) bbb.xxx.com: (No indexing) www.xxx.com: (It should indexed) All the subdomains are under the same domain. How can we achive that?
-1
votes
1 answer

stop my "google sites CMS" website from being indexed on google?

I used it just for fun google sites while a ago, unfortunately, the website becomes viewable on google when I search with my name. what do I want? I want to not make it visible anymore when someone searches, but I don't want to delete it 100%, by…
-1
votes
1 answer

Want to disallow few url with the robots.txt

I want to block a few URLs in robots.txt, but I really don't know how to do this. Below I have mentioned the URL, How should I disallow the dynamic URL. I really appreciate it if you help me to get rid of these…
-1
votes
1 answer

What does this robot.txt mean?

There's a website that I need to crawl, I have no financial purpose just to study. I checked the robots.txt and it was as follows. User-agent: * Allow: / Disallow: /*.notfound.html Can I crawl this website using request and beautifulSoup? I…
lilak0110
  • 3
  • 1