How to copy and extract .gz files using python

Question

I am just starting to learn python and have a question.

How to create a script to do following: ( will write how i do it in bash)

Copy <file>.gz from remote server1 to local storage.

cp /dumps/server1/file1.gz /local/
Then extract that file locally.

gunzip /local/file1.gz
Then copy the extract file to remote server2 (for archiving and deduplication purposes)

cp /local/file1.dump /dedupmount
delete local copy of .gz file to free space on "temporary" storage

rm -rf /local/file1.gz

I need to run all that in loop for all files. All files and directories are NFS mounted on same server.

A for loop goes through /dump/ folder and looks for .gz files. Each .gz file will be first copied to /local directory, and then extracted there. Once extracted, the unzipped .dmp file will be copied to /dedupmount folder for archiving.

Just banging my head on wall how to write this.

John1024 · Answer 1 · 2016-07-18T17:55:10.827

Python Solution

While the shell code might be shorter, the whole process can be done natively in python. The key points in the python solution are:

With the gzip module, gzipped files are as easy to read as normal files.
To obtain the list of source files, the glob module is used. It is modeled after the shell glob feature.
To manipulate paths, use the python os.path module. It provides a OS-independent interface to the file system.

Here is sample code:

import gzip
import glob
import os.path
source_dir = "/dumps/server1"
dest_dir = "/dedupmount"

for src_name in glob.glob(os.path.join(source_dir, '*.gz')):
    base = os.path.basename(src_name)
    dest_name = os.path.join(dest_dir, base[:-3])
    with gzip.open(src_name, 'rb') as infile:
        with open(dest_name, 'wb') as outfile:
            for line in infile:
                outfile.write(line)

This code reads from the remote1 server and writes to the remote2 server. This is no need for a local copy unless you want one.

In this code, all decompression is done by the CPU on the local machine.

Shell code

For comparison, here is the equivalent shell code:

for src in /dumps/server1/*.gz
do
    base=${src##*/}
    dest="/dedupmount/${base%.gz}"
    zcat "$src" >"$dest"
done

Three-Step Python Code

This slightly more complex approach implements the OP's three-step algorithm which uses a temporary file on the local machine:

import gzip
import glob
import os.path
import shutil

source_dir = "./dumps/server1"
dest_dir = "./dedupmount"
tmpfile = "/tmp/delete.me"

for src_name in glob.glob(os.path.join(source_dir, '*.gz')):
    base = os.path.basename(src_name)
    dest_name = os.path.join(dest_dir, base[:-3])
    shutil.copyfile(src_name, tmpfile)
    with gzip.open(tmpfile, 'rb') as infile:
        with open(dest_name, 'wb') as outfile:
            for line in infile:
                outfile.write(line)

This copies the source file to a temporary file on the local machine, tmpfile, and then gunzips it from there to the destination file. tmpfile will be overwritten with every invocation of this script.

Temporary files can be a security issue. To avoid this, place the temporary file in a directory that is write-able only by the user who runs this script.

Thanx a lot for that thorough explanation :) the thing is, that i do need that "local" dir for exctactions, as the files are very big (around 100GB zipped and over 1TB unzipped in size), so i want all the I/O to go to local server disks, and once exctarcted, to copy the unzipped .dmp files to dedup mount on remote deduplication storage. so thats why its 3 step action. 1. copy from source server to local disk. 2. excract localy. 3. copy extracted files to remote dedup storage :) — bflance, Oct 27 '14 at 21:10
@bflance Actually, in the current answer, `zcat`/gzip does run locally. Consequently, _the extraction is done locally_. The only code running on server1 is is NFS and it is just doing the bare minimum to read the source file. Likewise, the only code running on server2 is NFS which is writing the file to disk. This is possible, in large part, because the gzip format was designed for use in pipelines. — John1024, Oct 27 '14 at 21:31
yes, but since its extraction through pipe, it will keep "piping" that .gz file from remote server untill its extracted and its long time and alot of I/O (unless i am missing something). these files are huge and we want to limit I/O on remote servers as much as possible to prevent performance impact as well as chance of "stalling" the NFS mounts. thats why we have "dedicated" server just for extractions of .gz files to its fast local drives in RAID0 array. by extracting straight from remote server it will generate too much load i guess.. — bflance, Oct 28 '14 at 08:38
this is the design plan: https://dl.dropboxusercontent.com/u/38751572/dedup-design.png — bflance, Oct 28 '14 at 08:44
@bflance Although it will not change the total IO from server1, the three step plan is simple enough: see updated answer. — John1024, Oct 28 '14 at 20:52
I got this error: File "decompress_gz.py", line 24, in outfile.write(line) TypeError: write() argument must be str, not bytes — BhishanPoudel, Jul 18 '16 at 15:40
@BhishanPoudel OK. It looks like you were using Python3. I have updated the code to use `open(dest_name, 'wb')` and this should make it work with Python3 also. — John1024, Jul 18 '16 at 16:32

score 0 · Answer 2 · answered Oct 26 '14 at 20:59

0

you can use the module urlopen

import urllib
#urlretrieve will save the file to local drive
urllib.urlretrieve(url,file_name_to_save)

now you can use gunzip utitlty to extract, use os.system

answered Oct 26 '14 at 20:59

Hackaholic

19,069
5
54
72

How to copy and extract .gz files using python

2 Answers2

Python Solution

Shell code

Three-Step Python Code