2

I host a few git repositories at git.nomeata.de using gitweb (and gitolite). Occasionally, a search engine spider comes along and begins to hammer the interface. While I generally do want my git repositories to show up in search engines, I do not want to block them completely. But they should not invoke expensive operations such as snapshotting the archive, searching or generating diffs.

What is the “best” robots.txt file for such an installation?

Joachim Breitner
  • 3,779
  • 3
  • 18
  • 21

1 Answers1

2

I guess this makes a good community wiki. Please extend this robots.txt if you think it can be improved:

User-agent: * 
Disallow: /*a=search*
Disallow: /*/search/*
Disallow: /*a=blobdiff*
Disallow: /*/blobdiff/*
Disallow: /*a=commitdiff*
Disallow: /*/commitdiff/*
Disallow: /*a=snapshot*
Disallow: /*/snapshot/*
Disallow: /*a=blame*
Disallow: /*/blame/*
iustin
  • 193
  • 6
Joachim Breitner
  • 3,779
  • 3
  • 18
  • 21
  • I've found that depending on how gitweb is configured, these URLs are not enough; e.g. in my installation, I needed to add for each `/.*a=xxx*` an entry of the type `/*/xxx/*`. – iustin Mar 14 '15 at 12:43