1

I have a list like

words = ["test", "secondTest", "thirdTest"]

and I have a function to find for every position, the most occurring element through the list, e.g. the most occurring element in first position is 't', in second is 'e' and so on. The problem is that it takes very long time, especially for very long lists. Here's my function:

def mostFrequent(lst: list[str]) -> str:
    dic = {}
    count, itm = 0, ''
    for item in reversed(sorted(lst)):
        dic[item] = dic.get(item, 0) + 1
        if dic[item] >= count:
            count, itm = dic[item], item
    return (itm)


def most_frequent_chars() -> str:
    words = ["test", "secondTest", "thirdTest"]
    maxOccurs = ""
    listOfChars = []
    for i in range(len(max(words, key=len))):
        for item in words:
            try:
                listOfChars.append(item[i])
            except IndexError:
                pass

        maxOccurs += mostFrequent(listOfChars)
        listOfChars.clear()
    return maxOccurs

I'm using another function because before I was just using a one-liner max() but it'd take a lot longer than this, so I swapped with a dictionary method, and I'm looking for making this a lot more efficient, I cannot use any import.

MattMlgn
  • 33
  • 7

1 Answers1

0

One liner

[collections.Counter(k for k in x if k).most_common(1)[0][0] for x in itertools.zip_longest(*words)]

itertools.zip_longest is like zip, except that it works even with list of different sizes, returning None for list that are overs

So ((k for k in x if k) for x in itertools.zip_longest(*words)) iterates through iterators that themselves iterates through letters. I mean, the first iterator goes 't', 's', 't', the second 'e', 'e', 'h'.

It is clearer with lists (so [ instead of ():

[[k for k in x if k] for x in itertools.zip_longest(*words)]
#[['t', 's', 't'], ['e', 'e', 'h'], ['s', 'c', 'i'], ['t', 'o', 'r'], ['n', 'd'], ['d', 'T'], ['T', 'e'], ['e', 's'], ['s', 't'], ['t']]

collections.Counter(['t','s','t']) counts occurrences in the iterable (here a list, but it could be any iterable, including an iterator), and returns Counter({'t': 2, 's': 1})

collections.Counter(['t','s','t']).most_common(1) only returns the 1 most common. In a form of a list, because there could be a tie. Here [('t', 2)]

So, assuming that if there is a tie, we just want any of the winners (you haven't specified otherwise), collections.Counter(['t','s','t']).most_common(1)[0] is that winner. Here ('t',2).

And collections.Counter(['t','s','t']).most_common(1)[0][0] is just the letter. Here 't'.

So, applying that on all triplets given by zip_longest (itertools.zip_longest(*words)), once filtered None out (see before), that gives my one liner [collections.Counter(k for k in x if k).most_common(1)[0][0] for x in itertools.zip_longest(*words)] that returns ['t', 'e', 's', 't', 'n', 'd', 'T', 'e', 's', 't']

Note that this is an iterator. Or at list, it would be, replacing outer [...] by (...). All the way. Using subiterators. But no where using any lists or dicts (just the counter, over 3 letters at most. Or len(words) to be accurate).

So that means that you could have 3 generators of 10 billions letters words, and apply this one-liner on them, and it would return a generator of 10 billions letters, without allocating more than a few bytes memory.

(Of course that advantage is less interesting if your words are in memory anyway, like with strings. Or if your intent is to build a list of the result, as I did enclosing my iterator in a compound list. But yet, it is always better, when you have, essentially, a sequence of "what is the most frequent among those 3" computation to do, to avoid having big structure to allocate)

chrslg
  • 9,023
  • 5
  • 17
  • 31