0

I want to download all files with a specific ending from Github and use the Github Bigquery archive to achieve this.

With some help I already have this code, which kind of works:

SELECT
  f.repo_name, f.path, content.copies, content.size, content.content, lic.license, lang_table.language_name
FROM
  `bigquery-public-data.github_repos.files` AS f
JOIN
  `bigquery-public-data.github_repos.contents` AS content
ON
  f.id = content.id
JOIN
  `bigquery-public-data.github_repos.licenses` AS lic

ON
  f.repo_name = lic.repo_name 

JOIN
    (SELECT repo_name,lang.name as language_name FROM `bigquery-public-data.github_repos.languages` as lang_table, UNNEST(language) AS lang)  lang_table

ON f.repo_name = lang_table.repo_name
WHERE
  NOT content.binary
    AND (
         (f.path LIKE '%.po') OR (f.path LIKE '%.pot') OR (f.path LIKE '%.POT') OR (f.path LIKE '%.PO')
     )

Unfortunately, I think that it fetches many duplicates for each file due to each project having many forks. I would like to ignore all forks (or at least ones with no activity) to not get these duplicates.

Or I also would be interested in another way to get all of these files. (There also exists the open machine learning training data set containing Github data called "The Pile", but the file endings there are unfortunately corrupted by a bug.)

Pux
  • 421
  • 3
  • 18

0 Answers0