0

I have a script that reads some data from a binary stream of "packets" containing "parameters". The parameters read are stored in a dictionary for each packet, which is appended to an array representing the packet stream.

At the end this array of dict is written to an output CSV file.

Among the read data is a CUC7 datetime, stored as coarse/fine integer parts of a GPS time, which I also want to convert to a UTC ISO string.

from astropy.time import Time

def cuc2gps_time(coarse, fine):
    return Time(coarse + fine / (2**24), format='gps')


def gps2utc_time(gps):
    return Time(gps, format='isot', scale='utc')

The issue is that I realized that these two time conversions make up 90% of the total run time of my script, most of the work of my script being done in the 10% remaining (read binary file, decode 15 other parameters, write to CSV).

I somehow improved the situation by making the conversions in batches on Numpy arrays, instead of on each packet. This reduces the total runtime by about half.

import numpy as np

while end_not_reached:

    # Read 1 packet

    # (...)

    nb_packets += 1
    end_not_reached = ... # boolean

    # Process in batches for better performance
    if not nb_packets%1000 or not end_not_reached:
        # Convert CUC7 time to GPS and UTC times
        all_coarse = np.array([packet['lobt_coarse'] for packet in packets])
        all_fine = np.array([packet['lobt_fine'] for packet in packets])
        all_gps = cuc2gps_time(all_coarse, all_fine)
        all_utc = gps2utc_time(all_gps)

        # Add times to each packet
        for packet, gps_time, utc_time in zip(packets, all_gps, all_utc):
            packet.update({'gps_time': gps_time, 'utc_time': utc_time})

But my script is still absurdly slow. Reading 60000 packets from a 1.2GB file and writing it as a CSV takes 12s, against only 2.5s if I remove the time conversion.

So:

  1. Is it expected that Astropy time conversions are so slow? Am I using it wrong? Is there a better library?
  2. Is there a way to improve my current implementation? I suspect that the remaining "for" loop in there is very costly, but could not find a good way to replace it.
Guiux
  • 79
  • 1
  • 10
  • So I suppose this is a real-time stream of binary data, and it's not possible to read all the packets in at once? – Roy Smart Feb 03 '23 at 17:22
  • The last time I tried to optimize a slow code using Astropy , I found out Astropy is written in pure-Python making it particularly slow (at least 1 order of magnitude), but also that it massively uses a very inefficient metric functions/data-structures internally making it even slower (at least 2 order of magnitude). This It tends to be good for convenience and accuracy, but definitively not for performance... – Jérôme Richard Feb 03 '23 at 18:28
  • 1
    Indeed the astropy Time class is written first for accuracy, ease of understanding, and extensibility. That said, sometimes performance matters. As mentioned, providing array inputs is the best strategy if possible. If not, it is possible to use the `erfa` C-library converters directly to convert from GPS => JD1,JD2 (TAI) => JD1,JD2 (UTC) => ISOT. It's relatively obvious how to do this from the astropy source, but I can help if you want to go in this direction. Putting fast converters into astropy has been on my mind. – Tom Aldcroft Feb 03 '23 at 19:22
  • @RoySmart, no, it's not real time, I want to read a big binary file. But still, I cannot read all the packets at once, because I don't know their size in advance. I need to read one packet to know where the next one starts. – Guiux Feb 04 '23 at 20:02

1 Answers1

0

I think the problem is that you're doing multiple loops over your sequence of packets. In Python I would recommend having arrays representing each parameter, instead of having a list of objects, each with a bunch of scalar parameters.

If you can read all the packets in at once, I would recommend something like:

num_bytes = ...
num_bytes_per_packet = ...

num_packets = num_bytes / num_bytes_per_packet

param1 = np.empty(num_packets)
param2 = np.empty(num_packets)
...
time_coarse = np.empty(num_packets)
time_fine = np.empty(num_packets)
...
param_N = np.empty(num_packets)

for i in range(num_packets):
   param_1[i], param_2[i], ..., time_coarse[i], time_fine[i], ... param_N[i] = decode_packet(...)

time_gps = Time(time_coarse + time_fine / (2**24), format='gps')
time_utc = time_gps.utc.isot
   
Roy Smart
  • 664
  • 4
  • 12
  • I don't think I can implementing like this, because: 1. I don't know in advance the number of packets, nor the length of each packet. Packet length is actually one of the parameter to read, which allows me to identify where the next packet starts. Hence my WHILE loop. 2. Not all packets are the same, and can contain very different different parameters. Does it still make sense to have a numpy array for each parameter, and setting it to an empty value for mosts indexes i ? – Guiux Feb 05 '23 at 13:46