Loading a TAR file and extracting its *bz2 contents to sdout with bzcat

Question

Following this question, I'm attempting to load a 40 GB TAR file with bz2-compressed json files into PostgreSQL in an effecient manner.

As per the answer mentioned above, I'm trying to separate the process and use external tools to create the following flow.

Open & Extract the files to SDOUT with TAR (bsdtar in this case, as TAR does not include extracting in its Windows build), only *.bz2 files.
Extract the *BZ2 files with a call to bzcat (export to sdout)
Open this in my python script 'file_handling', which maps each incoming line to a tweet and outputs this as a csv to stdout
Pipe this to PSQL to load it into one COPY command.

I'm currently getting an error when arriving at bzcat, this is what I have to build line that executes the above:

pipeline = [filename[1:3] + " && ",  # Change drive to H so that TAR can find the file without a drive name (doesn't like absolute paths, apparently).
            '"C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bsdtar" vxOf ' + filename_nodrive + ' "*.bz2"',  # Call to tar, outputs to stdin
            " | C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bzcat.exe"#,  # Forward its output to bzcat
            ' | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py"', # Extract Tweets
            ' | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -1f copy.sql ' + secret_login_d
           ]
module_call = "".join(pipeline)
module_call = "H: && "C:\Tools\GnuWin32\gnuwin32\bin\bsdtar" vxOf "Twitter datastream/Sourcefiles/archiveteam-twitter-stream-2013-01.tar" "*.bz2" | C:\Tools\GnuWin32\gnuwin32\bin\bzcat.exe | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py" | "C:\Program Files\PostgreSQL\9.4in\psql.exe" -1f copy.sql "user=xxx password=xxx host=localhost port=5432 dbname=xxxxxx""

When executing the code for TAR, the TAR file is outputted to the CMD prompt, hinting me that all is well. However, the bzcat line brings an error:

x 01/29/06/39.json.bz2
bzcat.exe: Data integrity error when decompressing.
    Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

Running -tvv gives me:

huff+mtf data integrity (CRC) error in data

I've tried to extract the same archive with 7-zip (GUI): this still works. Any help on how to troubleshoot this would be greatly appreciated. I'm running Windows 8.1 with GNUWin32.

Thanks @MattV for this post! Very useful for my work. Could you please share the python code of "file_handling" so that I can try this method on the archive? Perhaps via github. Thanks! — Huanfa Chen, Dec 20 '16 at 13:24
This was a long time ago, hope this still makes sense: https://gist.github.com/MVersteeg/bb4a754823e7609bd204acbdb479d37d — MattV, Dec 20 '16 at 17:18
Thanks @MattV for the great Python code. Just one more question: do you have the copy.sql file? I have tried to write the copy.sql for a long time, but always get the result of "COPY 0". — Huanfa Chen, Jan 05 '17 at 15:49

score 2 · Accepted Answer · answered Jan 14 '15 at 04:21

bsdtar.exe is translating newline bytes in the file data into the DOS CRLF sequence resulting in a corrupted bzip2 output stream.

GNU tar worked when using relative paths but it does not handle absolute paths in Windows.

Your best bet is to use 7-zip instead:

7z.exe x -so -ir!*.json.bz2 archive.tar | bzcat | ...

Loading a TAR file and extracting its *bz2 contents to sdout with bzcat

1 Answers1