1

I want to disallow crawling of a directory /acct in robots.txt Which rule should I use?

Disallow: /acct or Disallow: /acct/

acct contains sub-directories and files both. What is the effect of a trailing slash?

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
marcg
  • 552
  • 1
  • 8
  • 20

1 Answers1

1

Since robots.txt rules are all "starts with" rules, both of your proposed rules would disallow the following:

  • https://example.com/acct/
  • https://example.com/acct/foo
  • https://example.com/acct/bar

However, the following would only be disallowed by the rule without the trailing slash:

  • https://example.com/acct
  • https://example.com/acct.html
  • https://example.com/acctbar

Disallow: /acct/ is usually better because there is no risk of disallowing unexpected URLs. However, it does NOT prevent crawling of /acct.

In most cases web servers redirect directory URLs without a trailing slash to add the trailing slash. It is likely that on your server, https://example.com/acct redirects to https://example.com/acct/. If that is the case, it is usually fine to allow bots to crawl /acct with no trailing slash and see the redirect. They would be blocked from crawling the target of the redirect.

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
  • I have an angular application. /acct shows the Dashboard page which is sorta index.html. From there user can go to /acct/page1, /acct/page2 and so on. Is it important to have a url as /acct/ or /acct is fine from SEO perspective? Days when a directory held bunch of static html pages are long gone. Also if all my rest calls begin with /api and if I want to prevent robots crawling it, will Disallow: /api would be enough or /api/* ? Is /ap/ and /api/* are the same ? – marcg May 04 '22 at 16:11
  • 1
    Using either `/acct/` or `/acct` is fine as long as you are consistent about it. One should redirect to the other. It's traditional to use the one with the slash, but that isn't a hard and fast rule. – Stephen Ostermiller May 04 '22 at 16:17
  • 1
    `Disallow /api/` is better than `Disallow: /api/*` By default rules are "starts with" rules. The original spec didn't allow wild cards at all in `robots.txt`. Most clients still don't process wildcards. Only more advanced clients like search engine spiders can handle those. At best a wildcard at the end is redundant. At worst it will prevent the rule from being understood. – Stephen Ostermiller May 04 '22 at 16:20
  • Just for anyone who stumble upon here, I used nginx rule in https://serverfault.com/questions/597302/removing-the-trailing-slash-from-a-url-with-nginx location ~ ^(.+)/$ { return 301 $1$is_args$args; } to remove the trailing slash site-wide – marcg May 04 '22 at 16:58