1

What is the easiest way to clone only files matching a certain pattern, e.g. Java files, from a repository? I can do a git clone followed by a git checkout with some pattern, but this will result in downloading unnecessary files, potentially a large number of them, which I want to avoid.

Context: I am building a tool to download code from a large number of repositories for a Machine Learning model training. I plan to download thousands or tens of thousands of repositories, so I want to make sure I only download the files I need, first, to speed up the process and, second, to reduce the likelihood of throttling from GitHub.

Thanks, Rafid

Rafid
  • 18,991
  • 23
  • 72
  • 108
  • You could try a shallow clone, but then every commit is a snapshot of the entire repo. – evolutionxbox Apr 21 '22 at 07:12
  • Yeah, I thought of shallow clone, but still, that would include a large number of unnecessary files. I basically want to be able to tell git to clone, say, all Java files from the latest commit of the main branch of repository X. – Rafid Apr 21 '22 at 07:14
  • 1
    Git has a new (and not very user friendly or efficient, yet) thing called a *partial clone* where you tell Git not to bother copying some objects *yet*. Git records instead a placeholder for each object—in your case, these would be the "blob" objects that hold files, and perhaps the "tree" objects that hold the files' names as well—and then only actually fetches the underlying objects later if and when something calls for them. Making a filtered partial depth-1 clone and then individual (or sparse) checkouts gets you there ... except [continued] – torek Apr 21 '22 at 07:53
  • 1
    [cont'd] as currently implemented, Git winds up making one connection per filtered object, which is incredibly inefficient and will probably cause GitHub to do throttling. So this is aimed at what you need, but you'll need to work on Git to make it more batch-y in its fetching. (Someone on the Git mailing list is actively working on this now so you could coordinate with them.) – torek Apr 21 '22 at 07:54
  • Interesting. I am curious as to why fetching only some of the objects (blobs or trees) is more likely to be throttled then the all of them. Does non-partial clones compress the entire thing or something? – Rafid Apr 23 '22 at 06:44

0 Answers0