0

Hi I am fairly new to python. I am trying to generate a powerset of all combinations for a list of integers, using the recommended code:

def powerset(iterable):
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

My list of integers comes in a numpy array from a pandas dataframe. Each int32 integer costs 48 bytes (not quite sure why so much). Thus, as the list of integers increases, it starts placing significant demands on RAM (e.g. 24 integers ==> at some point the list is about 800 Mb in size)

Is there a way around it? How would one manage the memory efficiently, if say you wanted to generate a powerset of 50 integers or more?

Thank you for any answers / pointers in advance.

TvelA
  • 1
  • Don't materialize a giant list of integers. – juanpa.arrivillaga Jan 30 '21 at 22:13
  • Note, Python objects are relatively heavy weight. But *even* assuming it cost 1 byte (it doesn't) the powerset of a set of size 50 has size (2**50), which is `(3**50)*1e-9` -> 1.6^6 *gigabytes*. So even if you don't materialize the list an lazily just processing something that large is going to take a longggg time. – juanpa.arrivillaga Jan 30 '21 at 22:18
  • So that's 1.6 *petabytes* and again, we know it doesn't cost 1 bytes, actually costs *8 bytes per pointer to the object, plus 48 bytes for the object*, so it really costs 56 bytes. Even at 1.6 petabytes you are in the realm of truly big data, you'd need some sort of cluste of machines with some distributed computing approach. – juanpa.arrivillaga Jan 30 '21 at 22:27
  • This isn't just a *memory* issue. I do not think you are really grasping the magnitude of what you want. Imagine the processing of each element took 1 nanosecond... and almost certainly takes orders of magnitude more, then to process it serially without some sort of giant cluster/distributed computing approach would take [*over 20 million years*](https://www.wolframalpha.com/input/?i=3**50+nanoseconds) – juanpa.arrivillaga Feb 01 '21 at 17:33
  • Thx, I am fully cognizant of the memory issue. However, ultimately I need to work out a function that relies on the powerset of the integers as x-axis, and the corresponding probabilities as y-axis. I guess, I can either decompose the power set into smaller chunks that can then be assembled up into the full powerset using cartesian products from a file. I wonder what that does to speed of the routine though – TvelA Feb 01 '21 at 17:38
  • You are not getting it. – juanpa.arrivillaga Feb 01 '21 at 17:39
  • I wonder how the statistical function is then estimated without using brute force – TvelA Feb 01 '21 at 17:39
  • That's a good question. Probably what you should be looking in to. There are other more math/statistics oriented stack exchange networks that may be more helpful in taht regard. – juanpa.arrivillaga Feb 01 '21 at 17:40
  • anyhow, thx for being there to bounce ideas against – TvelA Feb 01 '21 at 17:46
  • So at a high level are you trying in some vague sense to construct a function P that, given a subset S' of S, will return the probability that S' will occur? How do you establish the probabilities in the first place? – Tim Boddy Feb 01 '21 at 22:40
  • individual outcome probabilities are known / part of the input in my model – TvelA Feb 02 '21 at 16:26
  • Can you clarify a bit more what the inputs look like (including the probabilities) and what your function takes as input and must supply as output? – Tim Boddy Feb 02 '21 at 19:29

0 Answers0