0

I am trying to analyze monthly wikimedia pageview statistics. Their daily dumps are OK but monthly reports like the one from June 2021 (https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-06/pageviews-202106-user.bz2) seem broken:

[radim@sandbox2 pageviews]$ bzip2 -t pageviews-202106-user.bz2 
bzip2: pageviews-202106-user.bz2: bad magic number (file not created by bzip2)

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

[radim@sandbox2 pageviews]$ file pageviews-202106-user.bz2 
pageviews-202106-user.bz2: Par archive data

Any idea how to extract the data? What encoding is used here? Can it be Parquet file from their Hive analytics cluster?

Radim
  • 4,721
  • 1
  • 22
  • 25
  • These dumps have been generated as parquet files instead of bz2 files. It is a bug that has been reported and it will be fixed soon. https://phabricator.wikimedia.org/T287684 – Wences Aug 26 '21 at 07:33

1 Answers1

0

These files are not bzip2 archives. They are Parquet files. Parquet-tools can be used to inspect them.

$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main schema /tmp/pageviews-202106-user.bz2 2>/dev/null 
{
  "type" : "record",
  "name" : "hive_schema",
  "fields" : [ {
    "name" : "line",
    "type" : [ "null", "string" ],
    "default" : null
  } ]
}
Radim
  • 4,721
  • 1
  • 22
  • 25