0

I have a project to be converted from C++ to Python. The project was originally written by others. The logic is described as below:

"It loaded 20 plain text files in which each line represents a record of 7 columns. It then parses each line and loaded all data into objects of C++ vectors and maps. It then uses binary serialization to output C++ vectors and maps into corresponding 20 binary files".

I don't know what's the purpose of the author by using binary serialization, but I guess it might be due to three reasons:

  1. Reduced binary file sizes than plain text files
  2. Reduced initialization time, since the system starts by loading these files into memory and populating structures.
  3. It MIGHT speed up at runtime besides initialization stage, although the possibility is small.

I have little experience of object serialization and when I now rewrite it with python, I don't know whether I should use object serialization or not. In Python, does object serialization speed up programs noticeably, or it reduces file sizes to be stored on disk? Or any other benefits? I know the downside is that it makes program more complicated.

In implementing the logic described above in Python, should I use object serialization as well?

marlon
  • 6,029
  • 8
  • 42
  • 76
  • How do you save objects to disk to use for later *without* serializing them? – rchome Dec 09 '21 at 21:42
  • 1
    Almost certainly, you should use `numpy` where they are using C++ vectors. In which case, reading a numpy array from disk that has used binary serialization (e.g. `numpy.save` and `numpy.load`) will be much faster than parsing a text file. – juanpa.arrivillaga Dec 09 '21 at 21:45
  • @rchome I just load from the original text file. Parse each line and populate data structure. – marlon Dec 09 '21 at 21:48
  • @marlon well you see, that *is* serialization, just not very good serialization, since you are very likely re-invetning the wheel, just less efficiently. And now you have to maintain code that does that logic, when you probably should be using highly tested and tuned code to do it. For `dict` objets, probably `pickle`, for numpy.ndarray objects, `np.save` and `numpy.load`, although, I believe `pickle` will use those underneath the hood essentially anyway. – juanpa.arrivillaga Dec 09 '21 at 22:13
  • It's not a standard dict or list or set, I have to parse each line to populate those structures. I guess the serialization can save this parsing time, thought it might not be big. – marlon Dec 09 '21 at 22:35

1 Answers1

1

Sounds like it's almost certainly the case that the point is to avoid this part: "It then parses each line and loaded all data into objects of C++ vectors and maps".

If the binary machine data in the "C++ vectors and maps" can later be loaded directly, then all the expenses of parsing the text to recreate them from scratch can be skipped.

Can't answer whether you "should" do the same, though, without knowing details of the expenses involved. Not enough info here to guess.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • So the serialization purpose of the original author is to speed up system initialization? If I can load from plain text file in Python fairly quickly I can go without serialization, since it's a one-time job. Does that make sense? – marlon Dec 09 '21 at 21:50
  • 1
    Presumably the data is used more than once. The expense of parsing the data from text is skipped every time it's loaded from a pre-parsed binary file instead. This is why, e.g., CPython itself creates a binary `.pyc` file from each `.py` file you import: generating binary byte code from a text `.py` file isn't actually slow, but is _much_ slower than reading the binary `.pyc` file instead. But, again, you haven't given us nearly enough info for anyone to quantify what the tradeoffs may be in your specific case. – Tim Peters Dec 09 '21 at 21:58
  • I am sure the data is loaded ONCE in system initialization into vectors and maps, then of course in memory the vectors and maps are used multiple times. If the loading time is not big, probably not much benefit using serialization? – marlon Dec 09 '21 at 22:06
  • If the data is in fact used only once until the end of the universe, then there's no point whatsoever to serializing it - complete waste of time. – Tim Peters Dec 09 '21 at 22:08
  • "used only once": what do you mean? For many machine learning models, they are stored as binary format, and they are loaded once into data structures and those data structures in memory are used throughout the process to the end. Does this mean 'used only once'? – marlon Dec 09 '21 at 22:23
  • "Once" is really self-explanatory - don't over-think it ;-) Again, you've given us almost no info to go on. If, after running your program, you can throw away the text and binary files forever, then they're used "only once". But if you _may_ have to read them up again for another run of the program, then the data is not being used only once "until the end of the universe". – Tim Peters Dec 09 '21 at 22:29
  • I think there is no magic I described above. Think about a typical configuration file which a system usually reads it once at start-up time. I think my data is used like this way. Compared to a small configuration file, my data is big with more complex structure and probably that's the purpose of serialization? – marlon Dec 09 '21 at 22:38
  • If the data may be used on more than just one run of the program, then, yes, reading the data from a binary file will save the expense of parsing it again each & every time. But I'm saying nothing new here anymore, and you're giving no more useful information about your specific situation than you gave at the start. This is worse than pulling teeth ;-) – Tim Peters Dec 09 '21 at 22:42