4

I am trying to print out the Key Value pair in the same order as displayed in the OrderedCounter Output.

from collections import Counter, OrderedDict

class OrderedCounter(Counter, OrderedDict):
    pass

c = OrderedCounter('supernatural')
print c

I get the following output:

OrderedCounter({'u': 2, 'r': 2, 'a': 2, 's': 1, 'p': 1, 'e': 1, 'n': 1, 't': 1, 'l': 1})

Is there a way where I can only print out the first key, value pair?

I am basically trying to print the first repeated character in a given string.

Vishwak
  • 323
  • 3
  • 11

4 Answers4

5

The problem is that __repr__ is used by the first superclass (because you don't override it) and that is Counter. The representation of Counter is that it's sorted by the values in descending order. The fact that you subclass OrderedDict and sorted is stable makes it appear that "u" is the first element.

However Counter doesn't provide an __iter__ method so you'll use the __iter__ of OrderedDict which simply keeps the insertion order:

>>> next(iter(c.items()))
('s', 1)

To get the first repeated character simply use a comprehension:

>>> next((key, value) for key, value in c.items() if value > 1)
('u', 2)

(With Python2 you probably want to use iteritems() instead of items())

To print the first most common value you can use the Counter.most_common method:

>>> c.most_common(1)
[('u', 2)]
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • Thanks @MSeifert. This is exactly what I was looking for. I guess my OrderedCounter approach was wrong for the problem I was trying to solve. – Vishwak Feb 15 '17 at 11:33
2

You don't need Count or OrderedDict for this task. Here is an optimized approach (for a string of length n complexity is O(n) ):

In [35]: def first_repeated(s):
             seen = set()
             for i, j in enumerate(s):
                if j in seen: # membership check in set is O(1)
                    return j, s.count(j, i + 1) + 2 
                seen.add(j)
   ....:         

In [36]: first_repeated(s)
Out[36]: ('u', 2)

Here is a benchmark with other answer that shows this method is almost 4-5 time faster:

In [39]: def counter_based(s):
   ....:     c = Counter(s)
   ....:     return next(key for key in c if c[key] > 1)
   ....: 

In [40]: %timeit counter_based(s)
100000 loops, best of 3: 5.09 us per loop

In [41]: %timeit first_repeated(s)
1000000 loops, best of 3: 1.71 us per loop

Also you can do this task even faster using a suffix tree specially if you want to perform it on a large amount of data. Here is an optimized implementation of this algorithm by myself in github. You can also use the documentations and useful links if you are not familiar with this data structure and the algorithm https://github.com/kasramvd/SuffixTree

As another linear-based answer using str.counter within a generator expression you can use following approach suggested by @Stefan Pochmann:

next((c, s.count(c)) for c in s if s.count(c) > 1)
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • Both approaches are `O(n)` it's just that `str.count` is much faster than `iterating` over the string manually or with `Counter`. – MSeifert Feb 15 '17 at 11:39
  • What time do you get for `next((c, s.count(c)) for c in s if s.count(c) > 1)`? – Stefan Pochmann Feb 15 '17 at 11:42
  • @MSeifert Yes but it's not because of `counter` it's because you are looping over the string 2 times, once in creating the counter object and once for finding the expected character. Also this approach doesn't count from start of the string it counts from the index of last find to end. Although the indexing takes time as well but it's so cheaper than counting . – Mazdak Feb 15 '17 at 11:42
  • @StefanPochmann I didn't time that, because its a part of the answer. If you want to have a fair timing you need to involve the Coutner creating as well. – Mazdak Feb 15 '17 at 11:44
  • @Kasramvd It's not using a Counter. Here's the whole thing as a function: `def naive(s): return next((c, s.count(c)) for c in s if s.count(c) > 1)`. – Stefan Pochmann Feb 15 '17 at 11:46
  • @StefanPochmann Yeah I assumed so because of `c`. This might be slightly faster than my answer for short strings bot not for larger ones. But still it's based on internal optimizations that python performs. For example if within a generator expression it caches `s.count()` so that it doesn't calculate it two time it might be faster in general. – Mazdak Feb 15 '17 at 11:47
  • @Kasramvd 'c' as in 'character' :-). Yours might be faster without the `s[i + 1:]` optimization attempt, btw. At least for me and at least for that test string, that optimization attempt slows it down by a factor of about 1.25. – Stefan Pochmann Feb 15 '17 at 11:51
  • @StefanPochmann Indeed, without `s[i + 1:]` and only using `s.count` might be slightly faster bot in best cases, in worst cases `s[i + 1:]` is slightly better. Although in a low level language like `C` since `s[i + 1:]` performs in O(1) it would be the best way to go ;). – Mazdak Feb 15 '17 at 11:55
  • @Kasramvd I don't believe that that optimization is better in worst cases. You realize that without it, you also don't need the `enumerate` and the extra variable anymore, right? – Stefan Pochmann Feb 15 '17 at 12:03
  • The [`str.count`](https://docs.python.org/3/library/stdtypes.html#str.count) method even has a start and stop parameter. No need to slice! – MSeifert Feb 15 '17 at 12:05
  • @StefanPochmann Yeah that's a good point. But I'm not sure still! – Mazdak Feb 15 '17 at 12:06
  • @MSeifert Indeed, How could I miss that, thanks for reminding! – Mazdak Feb 15 '17 at 12:06
  • 1
    @Kasramvd Well with that optimization, you're paying extra for every single one of those first characters until the duplicate, and you're only saving a char-in-string count for just as many characters. I think the char-in-string count is faster, probably running [this](https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h#L152). Which is simple C (and likely even benefits from that string prefix still being in the cache). – Stefan Pochmann Feb 15 '17 at 12:16
  • @StefanPochmann Yes. The main difference is actually because of the abstraction problems. Although you might have a faster algorithmic method (Not significant faster) in python it's still slower against the slower one which performs in C. Exactly like these built-in methods. – Mazdak Feb 15 '17 at 12:59
-1

From what I understand, I think you are looking for something like this:

print c.most_common()[0]

This gives output ('u', 2)

Ajay Gupta
  • 1,285
  • 8
  • 22
  • 3
    Pretty sure this wouldn't work - based on the asker's description of 'trying to print the first repeated character in a given string'. If the string was, for example, `'aaxxxxxxxxxxxxx'`, your method would return `x` when the asker desires `a`. – asongtoruin Feb 15 '17 at 11:25
-1

If you need the counter somewhere down the line, it's possible to filter and sort it to get what you're looking for:

from collections import Counter

input_string = 'supernatural'
c = Counter(input_string)
print sorted((pair for pair in c.items() if pair[1]>1), key=lambda x: input_string.index(x[0]))[0]

We filter the counter to only return letters that appear more than once, sort it according to its position in the input string, and return the first pair we find. Hence, this prints ('u', 2)

asongtoruin
  • 9,794
  • 3
  • 36
  • 47