-1

I am using a piece of itertools code (thanks SO!) that looks like:

# Break down into selectable sub-groups by unit name
groups = {k: [b for b in blobs if b.Unit==k] for k in ['A','B','C','D','E','F']}
# Special treatment for unit F: expand to combination chunks of length 3
groups['F'] = combinations(groups['F'], 3)
# Create the list of all combinations
selected = list(product(*groups.values()))

The problem is that my blobs list above contains about 400 items, which means that the resulting list object, selected, would have trillions & trillions of possible combinations (something like 15x15x15x15x15x15x15x15x15). I'm not new to programming, but I am new to working with large data sets. What kind of hardware should I be looking for to handle itertools like this? Are there any reasonably affordable machines that can handle this type of thing? I've obviously taken my Python skills beyond my trusty iMac . . .

Unknown Coder
  • 6,625
  • 20
  • 79
  • 129
  • 1
    I think that for so much data you'll want to use something else than Python, like a C program which would be way more efficient. –  Apr 01 '14 at 00:54
  • 1
    This is not a good problem for pure Python. You may want to look at writing some extensions to do the hard work in a more efficient and speedy manner. – ebarr Apr 01 '14 at 00:56
  • 1
    What is it that you're planning to do with your trillions of items? If you want to iterate over them for individual processing, `itertools.product` is probably actually pretty efficient (don't wrap it in a `list`, just use the iterator). But if you're doing anything time consuming, its likely to take a very long time regardless of what kind of system and software you use. – Blckknght Apr 01 '14 at 00:59
  • 1
    You can implement `combinations` and `product` on your own as single element generators. Then you can consume that without any problem. – thefourtheye Apr 01 '14 at 00:59
  • Point taken. But let's assume I want to stay with Python to meet client requirements. What then? – Unknown Coder Apr 01 '14 at 00:59
  • 1
    So you want to generate every way to pick (a unit from A, a unit from B, ... three units from F)? How do you decide which selection is best? If you can eliminate some possible combinations early, it could hugely reduce the final number of selections to be evaluated. – Hugh Bothwell Apr 01 '14 at 01:06
  • 2
    What is the question? C might be faster than Python but Assembler is faster than both. You have working code, and the current problem (trillions of items) can be solved by removing the call to `list` and consuming the generator normally, thus removing the memory overhead. If you do build a supercomputer to tackle this problem, are you going to build one for your client, too? – Two-Bit Alchemist Apr 01 '14 at 01:06
  • @Two-BitAlchemist you'll have to forgive me because I am new to the itertools library, I literally just discovered it on SO today. And you're right, I did some more testing and the call to list is the main choke point. So how can I refactor the code to stay within itertools? My next step is to filter based on aggregate values of each iteration. – Unknown Coder Apr 01 '14 at 01:14
  • Maybe you want to look at numpy? – Linuxios Apr 01 '14 at 01:20
  • 2
    The key insight into itertools is **lazy evaluation**. Instead of a list (a high-level overview) you are working with a sequence of objects one at a time, performing a calculation, and moving on. Since you're only ever working on one of them (and you know how to reliably get the next one), you'll never have to worry about storing trillions of items in your RAM. Your call to `list` tries to do exactly that. – Two-Bit Alchemist Apr 01 '14 at 01:21

1 Answers1

1

If you can forgo materializing the list, then the code will work as-is on any computer. That's the power of generators.

Benji York
  • 2,044
  • 16
  • 20