0

Say I am continuously generating new data (e.g. integers) and want to collect them in a list.

import random

lst = []
for _ in range(50):
    num = random.randint(0, 10)
    lst.append(num)

When a new value is generated, I want it to be positioned in the list based on the count of occurrences of that value, so data with lower "current occurrence" should be placed before those with higher "current occurrence".

"Current occurrence" means "the number of duplicates of that data that have already been collected so far, up to this iteration". For the data that have the same occurrence, they should then follow the order in which they are generated.

For example, if at iteration 10 the current list is [1,2,3,4,2,3,4,3,4], let's say a new value 1 is generated, then it should be inserted at index 7, resulting in [1,2,3,4,2,3,4,1,3,4]. Because it is the second occurrence of 1, it should be placed after all the values that only occur once, but after all other existing items that occur twice: 2, 3 and 4 (hence, preserving the order).


This is my current code that can rearrange the list:

from collections import defaultdict

def rearrange(lst):
    d = defaultdict(list)
    count = defaultdict(int)
    for x in lst:
        count[x] += 1
        d[count[x]].append(x)
    res = []
    for k in sorted(d.keys()):
        res += d[k]
    return res

lst = rearrange(lst)

However, this is not giving my expected result.

I wrote a separate algorithm that keeps generating new data until some convergence criterion is met, where the list has the potential to become extremely large.

Therefore I want to rearrange my generated values on-the-fly, i.e. to constantly insert data into the list "in-place". Of course I can use my rearrage function in each iteration, but that would be super inefficient. What I want to do is to insert new data into the correct position of the list, not replacing it with a new list in each iteration.

Any suggestions?

Edit: the data structure doesn't necessarily need to be a list, but it has to be ordered, and doesn't require another data structure to hold information.

Rodrigo Rodrigues
  • 7,545
  • 1
  • 24
  • 36
Shaun Han
  • 2,676
  • 2
  • 9
  • 29
  • 1
    Not sure I understand... How about a `dict` with an integer key and a integer count value. – Fiddling Bits Oct 07 '22 at 18:26
  • @FiddlingBits That's exactly used in my code – Shaun Han Oct 07 '22 at 18:42
  • So every time you add an element to the list and its count changes, your algorithm would have to pop and insert every single matching element and move it. Is this what you are asking for? – Alexander Oct 07 '22 at 18:44
  • What are you trying to achieve? I think you may just need a [`Counter`](https://docs.python.org/3/library/collections.html) – Rodrigo Rodrigues Oct 07 '22 at 18:46
  • @RodrigoRodrigues I added an example. Hope it is clear now. – Shaun Han Oct 07 '22 at 18:56
  • If your goal is to have a list that grows vaguely efficiently 'on the fly', the regular list isn't gonna help you. Consider: when you 'insert' at an index, you are actually moving everything *after* that index. [Worst case, O(n).](https://wiki.python.org/moin/TimeComplexity) You probably need a (doubly) linked list instead. Even there, given your algorithm, it's tough to see how you update the list without making a complete pass over it every time unless there is a second data structure holding information - in which case, is reconstructing the list that much more inefficient? – Nathaniel Ford Oct 07 '22 at 19:04
  • 2
    The more I think about it, the more I'm convinced you need to either a) do a complete pass over the list so you know if, at a given index, there are any numbers after that point that weren't in the list (at the correct occurence rate) prior to that point, so that you can know whether your new number can be inserted at that index legally or b) hold information in a separate structure that 'caches' that information in a better manner. And if you use a basic list, you're doing a full `O(n)` pass every time at the least. – Nathaniel Ford Oct 07 '22 at 19:11
  • @NathanielFord A different data structure (e.g. deque) is fine if it could work. – Shaun Han Oct 07 '22 at 19:22

1 Answers1

1

The data structure I think that might work better for your purpose is a forest (in this case, a disjoint union of lists).

In summary, you keep one internal list for each occurrence of the values. When a new value comes, you add it to the list just after the one you added the last value this item came.

In order to keep track of the counts of occurrences, you can use a built-in Counter.

Here is a sample implementation:

from collections import Counter

def rearranged(iterable):
  forest, counter = list(), Counter()
  for x in iterable:
    c = counter[x]
    if c == len(forest):
      forest.append([x])
    else:
      forest[c] += [x]
    counter[x] += 1
  return [x for lst in forest for x in lst]

rearranged([1,2,3,4,2,3,4,3,4,1])
# [1, 2, 3, 4, 2, 3, 4, 1, 3, 4]

For this to work better, your input iterable should be a generator (so the items can be generated on the fly).

Rodrigo Rodrigues
  • 7,545
  • 1
  • 24
  • 36