I want to download all files with a specific ending from Github and use the Github Bigquery archive to achieve this.
With some help I already have this code, which kind of works:
SELECT
f.repo_name, f.path, content.copies, content.size, content.content, lic.license, lang_table.language_name
FROM
`bigquery-public-data.github_repos.files` AS f
JOIN
`bigquery-public-data.github_repos.contents` AS content
ON
f.id = content.id
JOIN
`bigquery-public-data.github_repos.licenses` AS lic
ON
f.repo_name = lic.repo_name
JOIN
(SELECT repo_name,lang.name as language_name FROM `bigquery-public-data.github_repos.languages` as lang_table, UNNEST(language) AS lang) lang_table
ON f.repo_name = lang_table.repo_name
WHERE
NOT content.binary
AND (
(f.path LIKE '%.po') OR (f.path LIKE '%.pot') OR (f.path LIKE '%.POT') OR (f.path LIKE '%.PO')
)
Unfortunately, I think that it fetches many duplicates for each file due to each project having many forks. I would like to ignore all forks (or at least ones with no activity) to not get these duplicates.
Or I also would be interested in another way to get all of these files. (There also exists the open machine learning training data set containing Github data called "The Pile", but the file endings there are unfortunately corrupted by a bug.)