0

I have a large list of dict objects. I would like to store this list in a tar file to exchange remotely. I have done that successfully by writing a json.dumps() string to a tarfile object opened in 'w:gz' mode.

I am trying for a piped implementation, opening the tarfile object in 'w|gz' mode. Here is my code so far:

from json import dump
from io import StringIO
import tarfile

with StringIO() as out_stream, tarfile.open(filename, 'w|gz', out_stream) as tar_file:
    for packet in json_io_format(data):
        dump(packet, out_stream)

This code is in a function 'write_data'. 'json_io_format' is a generator that returns one dict object at a time from the dataset (so packet is a dict).

Here is my error:

Traceback (most recent call last):
  File "pdml_parser.py", line 35, in write_data
    dump(packet, out_stream)
  File "/.../anaconda3/lib/python3.5/tarfile.py", line 2397, in __exit__
    self.close()
  File "/.../anaconda3/lib/python3.5/tarfile.py", line 1733, in close
    self.fileobj.close()
  File "/.../anaconda3/lib/python3.5/tarfile.py", line 459, in close
    self.fileobj.write(self.buf)
TypeError: string argument expected, got 'bytes'

After some troubleshooting with help from the comments, the error is caused when the 'with' statement exits, and tries to call the context manager __exit__. I BELIEVE that this in turn calls TarFile.close(). If I remove the tarfile.open() call from the 'with' statement, and purposefully leave out the TarFile.close(), I get this code:

with StringIO() as out_stream:
    tarfile.open(filename, 'w|gz', out_stream) as tar_file:
    for packet in json_io_format(data):
        dump(packet, out_stream)

This version of the program completes, but does not produce the output file 'filname' and yields this error:

Exception ignored in: <bound method _Stream.__del__ of <targile._Stream object at 0x7fca7a352b00>>
Traceback (most recent call last):
  File "/.../anaconda3/lib/python3.5/tarfile.py", line 411, in __del__
    self.close()
  File "/.../anaconda3/lib/python3.5/tarfile.py", line 459, in close
    self.fileobj.write(self.buf)
TypeError: string argument expected, got 'bytes'

I believe that is caused by the garbage collector. Something is preventing the TarFile object from closing.

Can anyone help me figure out what is going on here?

kingledion
  • 2,263
  • 3
  • 25
  • 39
  • 1
    Your exception isn't happening during the loop, but rather at the end of the `with` block (which is after the end of the loop). The `__close__` call to the `tarfile` context manager is having problems with the data in some way that I don't entirely understand (thus this being a comment rather than an answer). To simplify debugging, you could perhaps test by just `dump`ing one value without a loop. – Blckknght Aug 23 '16 at 19:31
  • I re-wrote the function to remove the tarfile declaration from the with statement and got the same error. I removed the tar_file.close() and now I get the error when the garbage collector tries to remove the stream object. So yes, there is something wrong with the tarfile closing. I will amend my question to reflect this, thanks for the tip. – kingledion Aug 23 '16 at 20:05

1 Answers1

2

Why do you think you can write a tarfile to a StringIO? That doesn't work like you think it does.

This approach doesn't error, but it's not actually how you create a tarfile in memory from in-memory objects.

from json import dumps                                                               
from io import BytesIO                                                     
import tarfile                                                                       

data = [{'foo': 'bar'},                                                              
        {'cheese': None},                                                            
        ]                                                                            

filename = 'fnord'                                                                   
with BytesIO() as out_stream, tarfile.open(filename, 'w|gz', out_stream) as tar_file:
    for packet in data:                                                              
        out_stream.write(dumps(packet).encode())                                     
Wayne Werner
  • 49,299
  • 29
  • 200
  • 290
  • What I was trying to do was to use the StringIO to fill the tarfile. I wanted to do this in a piped manner, so that I be processing the data and feeding it through a pipe to this script, while this script would feed that data in json format into a tar file. Every now and then, I will bundle up the tar file and send it off. I am getting the impression from your answer, that what I am actually doing is sending the tarfile into the output stream. I don't understand how I am doing that. – kingledion Aug 23 '16 at 20:56
  • When you say `piped manner`, do you actually mean `in memory`? Because those are two different things, and it *looks* like you're trying to do the latter. If that's the case what you *want* to be doing is creating a tarfile, like I do with the `BytesIO`, because hey, tarfiles are binary! What you do next depends on if you want a single file with a bunch of JSON data in it (which is not a valid JSON file) or if you want a bunch of JSON files in your tar. You can either a) open up one file, and when it gets big enough, tar it up and send it off or b) open up a file for each chunk of data – Wayne Werner Aug 23 '16 at 21:55