I need to transfer the very large dataset (between 1-10 mil records, possibly much more) from a domain-specific language (whose sole output mechanism is a C-style fprintf
statement) to Python.
Currently, I'm using the DSL's fprintf
to write records to a flat file. The flat file looks like this:
x['a',1,2]=1.23456789012345e-01
x['a',1,3]=1.23456789012345e-01
x['a',1,4]=1.23456789012345e-01
y1=1.23456789012345e-01
y2=1.23456789012345e-01
z['a',1,2]=1.23456789012345e-01
z['a',1,3]=1.23456789012345e-01
z['a',1,4]=1.23456789012345e-01
As you can see the structure of each record is very simple (but the representation of the double-precision float as a 20-char string is grossly inefficient!):
<variable-length string> + "=" + <double-precision float>
I'm currently using Python to read each line and split it on the "=".
Is there anything I can do to make the representation more compact, so as to make it faster for Python to read? Is some sort of binary-encoding possible with fprintf
?