3

I have an array and I want to count the occurrence of each item in the array.

I have managed to use a map function to produce a list of tuples.

def mapper(a):
    return (a, 1)

r = list(map(lambda a: mapper(a), arr));

//output example: 
//(11817685, 1), (2014036792, 1), (2014047115, 1), (11817685, 1)

I'm expecting the reduce function can help me to group counts by the first number (id) in each tuple. For example:

(11817685, 2), (2014036792, 1), (2014047115, 1)

I tried

cnt = reduce(lambda a, b: a + b, r);

and some other ways but they all don't do the trick.

NOTE Thanks for all the advice on other ways to solve the problems, but I'm just learning Python and how to implement a map-reduce here, and I have simplified my real business problem a lot to make it easy to understand, so please kindly show me a correct way of doing map-reduce.

Lee
  • 2,874
  • 3
  • 27
  • 51
  • 5
    `lambda a: mapper(a)`? Why not just pass `mapper`? Also: what is your expected output? – internet_user Dec 13 '17 at 02:43
  • Thanks for commenting. Yes, I can just pass in mapper directly, was experimenting something else. Have added my expected output. – Lee Dec 13 '17 at 02:47
  • Do you need `r` or is it just an intermediary? – internet_user Dec 13 '17 at 02:48
  • Just intermediary. – Lee Dec 13 '17 at 02:48
  • Use a dictionary. – juanpa.arrivillaga Dec 13 '17 at 02:49
  • That is not what [`reduce`](https://docs.python.org/3/library/functools.html#functools.reduce) does. Look into [collections.Counter](https://docs.python.org/3/library/collections.html#collections.Counter). – Galen Dec 13 '17 at 02:50
  • 2
    Neither `reduce` nor `map` really helps you here. This sort of task is why `collections.Counter` exists (and for more specialized cases where the inputs are already sorted, `itertools.groupby`). Map/Reduce strategies are for cases where you have many mappers in parallel feeding many reducers in parallel; blindly applying the same pattern to purely single-threaded code is wasteful (it's wasteful in Map/Reduce cases too, you just count on absurd levels of parallelism to make up for the overhead). – ShadowRanger Dec 13 '17 at 02:52
  • I think you're looking for a way to do `reduceByKey()` but I don't think that functionality exists using `reduce` alone. – pault Dec 13 '17 at 02:55
  • Take a look at [this post](https://stackoverflow.com/questions/29933189/reduce-by-key-in-python). – pault Dec 13 '17 at 02:59

4 Answers4

5

You could use Counter:

from collections import Counter
arr = [11817685, 2014036792, 2014047115, 11817685]
counter = Counter(arr)
print zip(counter.keys(), counter.values())

EDIT:

As pointed by @ShadowRanger Counter has items() method:

from collections import Counter
arr = [11817685, 2014036792, 2014047115, 11817685]
print Counter(arr).items()
scope
  • 1,967
  • 14
  • 15
  • 1
    Why `zip` the `keys` and `values`? There's an `items` method that does that directly: `print counter.items()`, and a special-purpose method `most_common`, that shows you the results in order by frequency (with an optional limit on the number of results), e.g. `print counter.most_common()`. – ShadowRanger Dec 13 '17 at 02:55
1

Instead of using any external module you can use some logic and do it without any module:

track={}
if intr not in track:
    track[intr]=1
else:
    track[intr]+=1

Example code :

For these types of list problems there is a pattern :

So suppose you have a list :

a=[(2006,1),(2007,4),(2008,9),(2006,5)]

And you want to convert this to a dict as the first element of the tuple as key and second element of the tuple. something like :

{2008: [9], 2006: [5], 2007: [4]}

But there is a catch you also want that those keys which have different values but keys are same like (2006,1) and (2006,5) keys are same but values are different. you want that those values append with only one key so expected output :

{2008: [9], 2006: [1, 5], 2007: [4]}

for this type of problem we do something like this:

first create a new dict then we follow this pattern:

if item[0] not in new_dict:
    new_dict[item[0]]=[item[1]]
else:
    new_dict[item[0]].append(item[1])

So we first check if key is in new dict and if it already then add the value of duplicate key to its value:

full code:

a=[(2006,1),(2007,4),(2008,9),(2006,5)]

new_dict={}

for item in a:
    if item[0] not in new_dict:
        new_dict[item[0]]=[item[1]]
    else:
        new_dict[item[0]].append(item[1])

print(new_dict)

output:

{2008: [9], 2006: [1, 5], 2007: [4]}
Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88
1

After writing my answer to a different question, I remembered this post and thought it would be helpful to write a similar answer here.

Here is a way to use reduce on your list to get the desired output.

arr = [11817685, 2014036792, 2014047115, 11817685]

def mapper(a):
    return (a, 1)

def reducer(x, y):
    if isinstance(x, dict):
        ykey, yval = y
        if ykey not in x:
            x[ykey] = yval
        else:
            x[ykey] += yval
        return x
    else:
        xkey, xval = x
        ykey, yval = y
        a = {xkey: xval}
        if ykey in a:
            a[ykey] += yval
        else:
            a[ykey] = yval
        return a

mapred = reduce(reducer, map(mapper, arr))

print mapred.items()

Which prints:

[(2014036792, 1), (2014047115, 1), (11817685, 2)]

Please see the linked answer for a more detailed explanation.

pault
  • 41,343
  • 15
  • 107
  • 149
0

If all you need is cnt, then a dict would probably be better than a list of tuples here (if you need this format, just use dict.items).

The collections module has a useful data structure for this, a defaultdict.

from collections import defaultdict
cnt = defaultdict(int) # create a default dict where the default value is
                       # the result of calling int
for key in arr:
  cnt[key] += 1 # if key is not in cnt, it will put in the default

# cnt_list = list(cnt.items())
internet_user
  • 3,149
  • 1
  • 20
  • 29