The context here is that I'm trying to use TruffleHog to scan a list of repos in my GitHub Enterprise organisation, for hardcoded secrets. I'm using the GitHub CLI to retrieve the list of all repositories in the organisation, by running:
gh repo list [github-internal-org-name]
The TruffleHog tool, while it could scan a repo locally on the filesystem, it is much faster to authenticate to GitHub Enterprise, and directly scan the repo on the remote server. I am iterating through the list of repositories I got with gh
above, and simply pointing TruffleHog at every individual repo instead of "clone + scan + delete repo" at each iteration of the loop.
Now, the points to keep in mind are:
- I'm automating this process, a pipeline/cron job running this script, say once a month
- Each month build the list of repos again (there might be new repos added/deleted or commits in the meantime)
Here's the code if it helps to visualise better what I'm doing so far:
gh repo list -L 9999 ${gh_org} >> $repo_list
echo "Scan started at $(date '+%Y/%m/%d %H:%M')"
while IFS= read -r line
do
repo_url=$(echo "$line" | awk '{print $1}')
full_repo_url="https://${gh_org_domain}/${repo_url}"
echo "Scanning repository ${repo_url}..."
trufflehog github --json --token $GH_PIPELINE_SA_PAT --endpoint https://${gh_org_domain} --repo ${full_repo_url} >> $raw_scan_file 2>> $th_errors_file
done < "$repo_list"
echo "Scan finished at $(date '+%Y/%m/%d %H:%M')"
All good so far, now I want to optimise things and avoid scanning each repo in its entirety, every month. The first time is ok, but then I only want to look at commits delta, right? Start from the latest commit at the time of the scan last month, until today. TruffleHog allows me to do this via the --since-commit=SHA
option.
Here's the main question: how could I get that SHA? How can I retrieve, for each repo in my list, given the date of my previous scan, the first commit of the date? (I know this could result in a bit of an overlap there at the beginning of the day - but it's fine). Considerations:
- Don't want to clone - defeats the purpose of speed (otherwise I could use
git log --after=[date]
andawk
on the cloned repo to get that first SHA) - I know
git ls-remote
could output commits & SHAs from a remote repo but it doesn't support specifying a date - The first idea was a "database" of sorts, during each iteration save the latest commit SHA next to its corresponding repo URL inside the
$repo_list
file; the next month, start from there & then replace the SHA with the "new" latest, and so on. - However, how would I now keep the repo file up to date with added/removed repos repos in the last month? So I do need to re-build it from scratch each time
Appreciate this might have turned into a program design question, but any suggestions on an approach would be amazing!