0

I wan't to basically serialize a large object and store it into redis. The serialization and deserialization cost is close to 0.

This file should be accessible by multiple 'apps'. My problem is that it takes quite some time to get the object even tho both redis and the app currently run on localhost (0.6s).

Is redis the wrong tool for such job?

zacko
  • 179
  • 2
  • 9
  • This is difficult to answer with the given information. Are you saying that the serialization IS close to zero or SHOULD BE close to zero? Perhaps explain your flow a bit. Like is this object being produced by a web upload? Incoming event? How many objects could need handling at the same time? Are there only a handful of objects total? or potentially infinite? Basically what kind of scale are you dealing with? Long story short, Redis may have some limitations with those kinds of sizes. – JoeW Jun 17 '22 at 14:20
  • 1) Serialization IS zero. I am using pyarrow to create a pyarrow for serializing data to_pybytes and i store them in redis after that. 2) The object is produced for analytics by an 'ETL' alike process and isn't infinite. Currently it's only 30-50mb and should be 500mb in 4 years. It's just this one object. Since redis stores only up to 512mb i could split the Dataframe into chunks later on. @JoeW – zacko Jun 17 '22 at 14:30
  • Sounds like something to put in an object store (e.g. S3). You can use a cdn if you want it cached, signing the data for private access. – Ben Manes Jun 17 '22 at 17:40
  • My problem is that it takes to long to read a file that's 300mb from redis. I can imagine that an external service like s3 will take even longer to load. (0.6 sec for redis on localhost both app and redis on the same machine) @BenManes – zacko Jun 17 '22 at 17:45
  • If you use s3/cdn then it could download chunks in parallel, by splitting it into byte offsets. I'm not proficient in python, but this [answer](https://stackoverflow.com/questions/58571343/downloading-a-large-file-in-parts-using-multiple-parallel-threads) might be a good lead. – Ben Manes Jun 17 '22 at 17:56
  • That's a different problem. He has a CPU bottleneck for loading data into memory, i have apparently a bandwidth bottleneck – zacko Jun 17 '22 at 18:12
  • 1
    Have you compared the speed to just reading the file from disk as an initial test? Shared memory (shm) would be an option, where all your applications can access the same segment of memory directly, without having any serialization or deserialization taking place. https://docs.python.org/3/library/multiprocessing.shared_memory.html - It also depends on a few other things - how often is the dataset updated? You only need to do the 0.6s read from the authoritative source _when it gets updated_; not every time you need to use/read it. – MatsLindh Jun 17 '22 at 19:31
  • @MatsLindh yes i am currently looking into the IPC format of apache arrow with memory mapping or Plasma which is shared memory for multiple applications without having to pull it out from redis at all. (If the applications live on the same machine) – zacko Jun 18 '22 at 19:14

0 Answers0