Can a URL be blocked using robot.txt disallow?

Question

I am trying to block our job board from being crawled. Can a specific URL be blocked with "Disallow" in a robot.txt file? And what would that look like for this URL? I don't want to just Disallow HTML, only the URL for the URL jobs.example.com

Disallow: https://jobs.example.com/

The robot.txt is a voluntarily read file. Google spider respects it, not many other will — mplungjan, Aug 31 '23 at 13:45
Do you want to simply indicate (suggest) to crawlers that they *shouldn't* access that page? Or do you want to actually *prevent* clients from accessing that page without some kind of authorization? These are two very different things. — David, Aug 31 '23 at 13:48
Outgoing links from your domain that shouldn't be followed should have `rel="nofollow"` attribute on the a tag, [see MDN docs for nofollow](https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes/rel#nofollow). Why can't `jobs.lrshealthcare.com` simply have a robots.txt of it's own? Oh yes it's `robots.txt`, not `robot.txt` maybe also correct that. — Peter Krebs, Aug 31 '23 at 13:52

score 0 · Answer 1 · answered Aug 31 '23 at 18:34

You can't put full URLs into robots.txt disallow rules. Your proposed rule WON'T WORK as written:

# INCORRECT
Disallow: https://jobs.example.com/

It looks like you might be trying to disallow crawling on the jobs subdomain. Doing so is possible. Each subdomain gets its own robots.txt file. You would have to configure you server to have different content for different robots.txt files:

https://example.com/robots.txt
https://jobs.example.com/robots.txt

Then your jobs robot.txt should disallow all crawling on that subdomain:

User-Agent: *
Disallow: /

If you are trying to disallow just the home page for that subdomain, you would have to use syntax that only the major search engines understand. You can use a $ for "ends with" and the major search engines will intepret it correctly:

User-Agent: *
Disallow: /$

score -1 · Answer 2 · answered Aug 31 '23 at 13:47

In order to disallow web crawlers from indexing one specific page, you can do so with the following lines:

User-agent: *
Disallow: /path/to/page/

Or the entire website

User-agent: *
Disallow: /

Note that not all search engines/crawlers will respect that file

Can a URL be blocked using robot.txt disallow?

2 Answers2