2

I am looking for a better way to count the array values than what I have described below (Graphlab Create with Python)

labels = graphlab.SArray([-1, -1, 1, 1, 1])

plus_ones_count = list(labels).count(1)
# plus_ones_count outputs 3

minu_ones_count = list(labels).count(-1)
# minu_ones_count outputs 2

Thank you for any pointers or suggestions.

After additional experiments len(labels[labels == ]) seems to be doing a better job (for my requirement where the desired number range is small) Just for others reference I am providing the code I used to measure three approaches. If you know any other better way of doing it (or) caveats please let me know.

import numpy as np
from random import randint
from collections import Counter

for data_set_size in [10, 100, 1000, 10000, 100000, 1000000]:
    labels = graphlab.SArray([randint(-1,1) for p in range(0, data_set_size)])
    print "Data set size: ", data_set_size

    %timeit -n 100 l = list(labels); l.count(-1), l.count(0), l.count(1)
    %timeit -n 100 len(labels[labels == -1]), len(labels[labels == 0]), len(labels[labels == 1])
    %timeit -n 100 label_count = Counter(labels); label_count.get(-1), label_count.get(0), label_count.get(1)
ImA ohW
  • 21
  • 4

3 Answers3

0

You can use Counter from collections

labels = [-1, -1, 1, 1, 1]
from collections import Counter
label_count = Counter(labels)
label_count.get(1)

3

label_count.most_common()

[(1, 3), (-1, 2)]

Ref link: https://docs.python.org/2/library/collections.html#collections.Counter

Vikash Singh
  • 13,213
  • 8
  • 40
  • 70
  • Thanks Vikash - But I am looking for a better approach with SArray data type of Graphlab. Somehow it is not obvious for me from their API documents. – ImA ohW Jun 26 '17 at 08:15
  • @SaravananChidambaram I don't think SArray has the Counter feature. Can't you convert your SArray to list and do the Counter computation? – Vikash Singh Jun 26 '17 at 08:45
  • 1
    Vikash: I did some additional experiments comparing 3 methods [1] The one converts SArray to list() and uses count() [2] Using access like len(labels[labels == -1]), which is what few seems to be using [3] Using Counter as you suggested. I limited the array values to {-1, 0, 1} with varying array size of 10 numbers to a million.You may be interested to know that Method #2 outperformed among the three as size increases. However I haven't experimented with any integer value instead of limiting to 3 values as in the case here. – ImA ohW Jun 26 '17 at 10:48
  • Method 2 might have better performance because you counted for only 1 number. Where as method 3 does counting for all unique values in the list. But I agree with you. It's just that I don't optimise until I have to. 2 sec for millions or 10 sec does not make a difference. Only when it gets to exponential times like minutes or days do I start optimisation. Personal style. Premature optimisation is not my style. Neither is complicating code for optimisation. Just stating my views :) – Vikash Singh Jun 26 '17 at 10:53
0

You can use this simple hack I used.

plus_one_count = labels.where(labels == 1, 1, 0).sum()

#plus_one_count = graphlab.SArray.where(labels == 1, 1, 0).sum()

minu_ones_count = labels.where(labels == -1, 1, 0).sum()

It's just returning an SArray with 1 where the condition is True and zero otherwise, and then sums it up.

You can find the documentation for this here.

Hope this solved your problem.

volf
  • 83
  • 1
  • 10
0

Source

labels = graphlab.SArray([-1, -1, 1, 1, 1])
print (labels == -1).nnz()
print (labels == 1).nnz()

Output

2
3

Links

https://turi.com/products/create/docs/generated/graphlab.SArray.nnz.html