3

I have got a program that handle about 500 000 files {Ai} and for each file, it will fetch a definition {Di} for the parsing.

For now, each file {Ai} is parsed by a dedicated celery task and each time the definition file {Di} is parsed again to generate an object. This object is used for the parsing of the file {Ai} (JSON representation).

I would like to store the definition file (generated object) {Di(object)} to make it available for whole task.

So I wonder what would be the best choice to manage it:

  1. Memcahe + Python-memcached,
  2. A Long running task to "store" the object with set(add)/get interface.

For performance and memory usage, what would be the best choice ?

Ali SAID OMAR
  • 6,404
  • 8
  • 39
  • 56

1 Answers1

1

Using Memcached sounds like a much easier solution - a task is for processing, memcached is for storage - why use a task for storage?

Personally I'd recommend using Redis over memcached.

An alternative would be to try ZODB - it stores Python objects natively. If your application really suffers from serialization overhead maybe this would help. But I'd strongly recommend testing this with your real workload against JSON/memcached.

scytale
  • 12,346
  • 3
  • 32
  • 46
  • Because I need to store nested objects and worry about the cost of serialization/deserialization. – Ali SAID OMAR Aug 07 '15 at 12:40
  • will you not have to serialize to get the object to your long-running task? what kind of get/set interface do you think it should have? – scytale Aug 07 '15 at 12:53
  • and are these native python objects that you need to store? – scytale Aug 07 '15 at 12:54
  • The object is a list of dict that could contain dict. Basically I just looking an efficient way to same data between task. – Ali SAID OMAR Aug 07 '15 at 14:16
  • 1
    from a coding point of view json->memcached is the easiest solution. whether serialisation overhead will be a real problem depends entirely on your application. We do something similar to share data between tasks and we find that while json serialization takes significant time for very large data structures a) this is dwarfed by the time required for the calculations b) it is worth it because json is so easy to deal with. – scytale Aug 07 '15 at 16:26