1

I'm coding a program that requires high memory usage. I use python 3.7.10. During the program I create about 3GB of python objects, modifying them. Some objects I create contain pointer to other objects. Also, sometimes I need to deepcopy one object to create another.

My problem is that these objects creation and modification takes a lot of time and causing some performance issues. I wish I could do some of the creation and modification in parallel. However, there are some limitations:

  • the program is very CPU-bound and there is almost no usage of IO/network - so multithreading library will not work due to the GIL
  • the system I work with has no Read-on-write feature- so using multiprocessing python library spend a lot of time on forking the process
  • the objects do not contain numbers and most of the work in the program are not mathematical - so I cannot benefit from numpy and ctypes

What can be a good alternative for this kind of memory to allow me to parallelize better my code?

Idan
  • 72
  • 1
  • 15
  • What types of objects? If they are mainly numbers, numpy or pandas may help (shared memory). If at least a lot of objects of same type are used, some arrays of ctype types could be used instead of usual Python objects. – Michael Butscher Sep 25 '22 at 09:51
  • @MichaelButscher added a note- the objects do not contain numbers and most of the work in the program are not mathematical – Idan Sep 25 '22 at 10:00
  • 1
    1. Can you share a bit more about the nature of the code? For example graph algorithms could still work with numpy. 2. If you create the multiprocessing pool early (before creating tons of objects), forking should still be cheap. 3. Did you consider other parallelization approaches like MPI? – Homer512 Sep 25 '22 at 10:36
  • Re, "...due to the GIL..." Maybe you need to consider writing the program in a different language. – Solomon Slow Sep 25 '22 at 14:54
  • Re, "deep copy... takes a lot of time." If your program spends most of its time just moving bytes around, then even in some other language, multiple threads may not help as much as you hope. When threads share variables, they must share through _main memory._ In most computer systems, there is only _one_ path in and out of main memory, and when multiple CPUs want to use it, the hardware makes them wait their turn. – Solomon Slow Sep 25 '22 at 14:57
  • Your question is just way too vague for us to help you. There are certainly techniques that come to mind, e.g. changing object representation, introducing memoization, algorithmic improvements, or tuning multiprocessing. But without some details, that's just impossible – Homer512 Sep 27 '22 at 15:24

3 Answers3

2

Deepcopy is extremely slow in python. A possible solution is to serialize and load the objects from the disk. See this answer for viable options – perhaps ujson and cPickle. Furthermore, you can serialize and deserialize objects asynchronously using aiofiles.

user1635327
  • 1,469
  • 3
  • 11
0

Can't you use your GPU RAM and use CUDA? https://developer.nvidia.com/how-to-cuda-python If it doesn't need to be realtime I'd use PySpark (see streaming section https://spark.apache.org/docs/latest/api/python/) and work with remote machines. Can you tell me a bit about the application? Perhaps you're searching for something like the PyTorch framework (https://pytorch.org/).

bieboebap
  • 320
  • 3
  • 18
0

You may also like to try using Transparent Huge Pages and a hugepage-aware allocator, such as tcmalloc. That may speed up your application by 5-15% without having to change a line of code.

See thp-usage for more information.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271