1

The context here is that I'm trying to use TruffleHog to scan a list of repos in my GitHub Enterprise organisation, for hardcoded secrets. I'm using the GitHub CLI to retrieve the list of all repositories in the organisation, by running:

gh repo list [github-internal-org-name]

The TruffleHog tool, while it could scan a repo locally on the filesystem, it is much faster to authenticate to GitHub Enterprise, and directly scan the repo on the remote server. I am iterating through the list of repositories I got with gh above, and simply pointing TruffleHog at every individual repo instead of "clone + scan + delete repo" at each iteration of the loop.

Now, the points to keep in mind are:

  • I'm automating this process, a pipeline/cron job running this script, say once a month
  • Each month build the list of repos again (there might be new repos added/deleted or commits in the meantime)

Here's the code if it helps to visualise better what I'm doing so far:

gh repo list -L 9999 ${gh_org} >> $repo_list
echo "Scan started at $(date '+%Y/%m/%d %H:%M')"
while IFS= read -r line
do
    repo_url=$(echo "$line" | awk '{print $1}')
    full_repo_url="https://${gh_org_domain}/${repo_url}"
    echo "Scanning repository ${repo_url}..."

    trufflehog github --json --token $GH_PIPELINE_SA_PAT --endpoint https://${gh_org_domain} --repo ${full_repo_url} >> $raw_scan_file 2>> $th_errors_file
done < "$repo_list"
echo "Scan finished at $(date '+%Y/%m/%d %H:%M')"

All good so far, now I want to optimise things and avoid scanning each repo in its entirety, every month. The first time is ok, but then I only want to look at commits delta, right? Start from the latest commit at the time of the scan last month, until today. TruffleHog allows me to do this via the --since-commit=SHA option.

Here's the main question: how could I get that SHA? How can I retrieve, for each repo in my list, given the date of my previous scan, the first commit of the date? (I know this could result in a bit of an overlap there at the beginning of the day - but it's fine). Considerations:

  • Don't want to clone - defeats the purpose of speed (otherwise I could use git log --after=[date] and awk on the cloned repo to get that first SHA)
  • I know git ls-remote could output commits & SHAs from a remote repo but it doesn't support specifying a date
  • The first idea was a "database" of sorts, during each iteration save the latest commit SHA next to its corresponding repo URL inside the $repo_list file; the next month, start from there & then replace the SHA with the "new" latest, and so on.
  • However, how would I now keep the repo file up to date with added/removed repos repos in the last month? So I do need to re-build it from scratch each time

Appreciate this might have turned into a program design question, but any suggestions on an approach would be amazing!

Emilian Cebuc
  • 331
  • 4
  • 20
  • Note: there are custom patterns which can be scanned on GHE side. See [Secret scanning emits audit log events for custom pattern push protection enablement](https://github.blog/changelog/2023-01-05-secret-scanning-emits-audit-log-events-for-custom-pattern-push-protection-enablement/) – VonC Jan 17 '23 at 08:29

1 Answers1

0

I know git ls-remote could output commits & SHAs from a remote repo but it doesn't support specifying a date

That is why your database-like idea has merit: no need for dates. as long as a git ls-remote reports different SHA1s, something has changed in the repository, and a scan is need.

how would I now keep the repo file up to date with added/removed repos repos in the last month? So I do need to re-build it from scratch each time

You would read $repo_list on one hand, and the list of repositories recorded in your "database" (simple text file with each repository name and ls-remote result)

And you would rebuild that text file after repository name reconciliation (removing the ones no longer present in $repo_list, adding new ones)

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250