Following this question, I'm attempting to load a 40 GB TAR file with bz2-compressed json files into PostgreSQL in an effecient manner.
As per the answer mentioned above, I'm trying to separate the process and use external tools to create the following flow.
- Open & Extract the files to SDOUT with TAR (bsdtar in this case, as TAR does not include extracting in its Windows build), only *.bz2 files.
- Extract the *BZ2 files with a call to bzcat (export to sdout)
- Open this in my python script 'file_handling', which maps each incoming line to a tweet and outputs this as a csv to stdout
- Pipe this to PSQL to load it into one COPY command.
I'm currently getting an error when arriving at bzcat, this is what I have to build line that executes the above:
pipeline = [filename[1:3] + " && ", # Change drive to H so that TAR can find the file without a drive name (doesn't like absolute paths, apparently).
'"C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bsdtar" vxOf ' + filename_nodrive + ' "*.bz2"', # Call to tar, outputs to stdin
" | C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bzcat.exe"#, # Forward its output to bzcat
' | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py"', # Extract Tweets
' | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -1f copy.sql ' + secret_login_d
]
module_call = "".join(pipeline)
module_call = "H: && "C:\Tools\GnuWin32\gnuwin32\bin\bsdtar" vxOf "Twitter datastream/Sourcefiles/archiveteam-twitter-stream-2013-01.tar" "*.bz2" | C:\Tools\GnuWin32\gnuwin32\bin\bzcat.exe | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py" | "C:\Program Files\PostgreSQL\9.4in\psql.exe" -1f copy.sql "user=xxx password=xxx host=localhost port=5432 dbname=xxxxxx""
When executing the code for TAR, the TAR file is outputted to the CMD prompt, hinting me that all is well. However, the bzcat line brings an error:
x 01/29/06/39.json.bz2
bzcat.exe: Data integrity error when decompressing.
Input file = (stdin), output file = (stdout)
It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.
Running -tvv gives me:
huff+mtf data integrity (CRC) error in data
I've tried to extract the same archive with 7-zip (GUI): this still works. Any help on how to troubleshoot this would be greatly appreciated. I'm running Windows 8.1 with GNUWin32.