3

I need to transfer the very large dataset (between 1-10 mil records, possibly much more) from a domain-specific language (whose sole output mechanism is a C-style fprintf statement) to Python.

Currently, I'm using the DSL's fprintf to write records to a flat file. The flat file looks like this:

x['a',1,2]=1.23456789012345e-01
x['a',1,3]=1.23456789012345e-01
x['a',1,4]=1.23456789012345e-01
y1=1.23456789012345e-01
y2=1.23456789012345e-01
z['a',1,2]=1.23456789012345e-01
z['a',1,3]=1.23456789012345e-01
z['a',1,4]=1.23456789012345e-01

As you can see the structure of each record is very simple (but the representation of the double-precision float as a 20-char string is grossly inefficient!):

<variable-length string> + "=" + <double-precision float>

I'm currently using Python to read each line and split it on the "=".

Is there anything I can do to make the representation more compact, so as to make it faster for Python to read? Is some sort of binary-encoding possible with fprintf?

Gilead
  • 1,263
  • 10
  • 21

2 Answers2

1

A compact binary format for serializing float values is defined in the basic encoding rules (BER). There they are called "reals". There are implementations of BER for Python available, but also not too hard to write. There are libraries for C as well. You could use this format (that's what it was designed for), or a variant (CER, DER). One such Python implementation is pyasn1.

Keith
  • 42,110
  • 11
  • 57
  • 76
  • It turns out the bottleneck wasn't Python's read operation. However, I'd never heard of BER before, and it sounds like something I might use in the future. Your answer also gave me hints as to what to wiki ("data serialization", "asn.1", etc.). Now I know what to look for the next time I need a serialization format. Thanks very much! – Gilead Jan 08 '13 at 16:16
1

Err.... How many times per minute are you reading this data from Python?

Because in my system I could read such a file with 20 million records (~400MB) in well under a second.

Unless you are performing this in a limited hardware, I'd say you are worrying too much about nothing.

>>> timeit("all(b.read(20) for x in xrange(0, 20000000,20)  ) ", "b=open('data.dat')", number=1)
0.2856929302215576
>>> c = open("data.dat").read()
>>> len(c)
380000172
jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • Thanks for your answer. I profiled my code with 20 million randomly generated records. It took 4.8 seconds to read, which is still quite acceptable. I guess the real bottleneck was the string split operation on every record (which I thought would benefit from a shorter string obtained through encoding), but after changing it from `line.split("=")` to `line.split("=", 1)`, I got an reasonable speed-up. So it seems that I don't really need to encode after all. – Gilead Jan 08 '13 at 16:13