2

Following this question, I'm attempting to load a 40 GB TAR file with bz2-compressed json files into PostgreSQL in an effecient manner.

As per the answer mentioned above, I'm trying to separate the process and use external tools to create the following flow.

  • Open & Extract the files to SDOUT with TAR (bsdtar in this case, as TAR does not include extracting in its Windows build), only *.bz2 files.
  • Extract the *BZ2 files with a call to bzcat (export to sdout)
  • Open this in my python script 'file_handling', which maps each incoming line to a tweet and outputs this as a csv to stdout
  • Pipe this to PSQL to load it into one COPY command.

I'm currently getting an error when arriving at bzcat, this is what I have to build line that executes the above:

pipeline = [filename[1:3] + " && ",  # Change drive to H so that TAR can find the file without a drive name (doesn't like absolute paths, apparently).
            '"C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bsdtar" vxOf ' + filename_nodrive + ' "*.bz2"',  # Call to tar, outputs to stdin
            " | C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bzcat.exe"#,  # Forward its output to bzcat
            ' | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py"', # Extract Tweets
            ' | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -1f copy.sql ' + secret_login_d
           ]
module_call = "".join(pipeline)
module_call = "H: && "C:\Tools\GnuWin32\gnuwin32\bin\bsdtar" vxOf "Twitter datastream/Sourcefiles/archiveteam-twitter-stream-2013-01.tar" "*.bz2" | C:\Tools\GnuWin32\gnuwin32\bin\bzcat.exe | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py" | "C:\Program Files\PostgreSQL\9.4in\psql.exe" -1f copy.sql "user=xxx password=xxx host=localhost port=5432 dbname=xxxxxx""

When executing the code for TAR, the TAR file is outputted to the CMD prompt, hinting me that all is well. However, the bzcat line brings an error:

x 01/29/06/39.json.bz2
bzcat.exe: Data integrity error when decompressing.
    Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

Running -tvv gives me:

huff+mtf data integrity (CRC) error in data

I've tried to extract the same archive with 7-zip (GUI): this still works. Any help on how to troubleshoot this would be greatly appreciated. I'm running Windows 8.1 with GNUWin32.

Community
  • 1
  • 1
MattV
  • 1,353
  • 18
  • 42
  • Thanks @MattV for this post! Very useful for my work. Could you please share the python code of "file_handling" so that I can try this method on the archive? Perhaps via github. Thanks! – Huanfa Chen Dec 20 '16 at 13:24
  • This was a long time ago, hope this still makes sense: https://gist.github.com/MVersteeg/bb4a754823e7609bd204acbdb479d37d – MattV Dec 20 '16 at 17:18
  • Thanks @MattV for the great Python code. Just one more question: do you have the copy.sql file? I have tried to write the copy.sql for a long time, but always get the result of "COPY 0". – Huanfa Chen Jan 05 '17 at 15:49

1 Answers1

2

bsdtar.exe is translating newline bytes in the file data into the DOS CRLF sequence resulting in a corrupted bzip2 output stream.

GNU tar worked when using relative paths but it does not handle absolute paths in Windows.

Your best bet is to use 7-zip instead:

7z.exe x -so -ir!*.json.bz2 archive.tar | bzcat | ...
user2313838
  • 270
  • 1
  • 4