how to calculate number of items in per user groupby item

Question

How can I output a result like this:

user    I   R   H
=================
atl001  2   1   0
cms017  1   2   1
lhc003  0   1   2

from a list like this:

atl001 I
atl001 I
cms017 H
atl001 R
lhc003 H
cms017 R
cms017 I
lhc003 H
lhc003 R
cms017 R

i.e. I want to calculate the number of I, H and R per user. Just a note that I can't use groupby from itertools in this particular case. Thanks in advance for your help. Cheers!!

I'm too old to do home work. Those are the partial result from running `condor_q -format "%s " Owner -format "%s\n" 'ifThenElse(JobStatus==1,"I",ifThenElse(JobStatus==2,"R",ifThenElse(JobStatus==5,"H",string(JobStatus)))'`, shows current job status at site. I'm just trying sum up the `running`, `hold` and `idle` jobs per user. Cheers!! — MacUsers, Apr 17 '11 at 08:29

score 6 · Accepted Answer · edited Apr 17 '11 at 16:36

6

data='''atl001 I
atl001 I
cms017 H
atl001 R
lhc003 H
cms017 R
cms017 I
lhc003 H
lhc003 R
cms017 R'''

stats={}
for i in data.split('\n'):
    user, irh = i.split()
    u = stats.setdefault(user, {})
    u[irh] = u.setdefault(irh, 0) + 1

print 'user  I  R  H'
for user in sorted(stats):
    stat = stats[user]
    print user, stat.get('I', 0), stat.get('R', 0), stat.get('H', 0)

edited Apr 17 '11 at 16:36

jfs

399,953
195
994
1,670

answered Apr 17 '11 at 09:14

dugres

12,613
8
46
51

@dugres: this is where I did wrong: `u = stats.get(user, {})`. I just can't run your script because of `sorted()` (as I'm using v2.3) but I can take it from here. Thanks for the help. cheers!! – MacUsers Apr 17 '11 at 10:54
@MacUsers What does it mean that you do wrong at ``u = stats.get(user, {})`` The dictionary's method get() exists in Python 2.3 By the way, how long do you plan to use a prehistoric version of Python ? – eyquem Apr 17 '11 at 11:32
@eyquem: Using v2.3 is not `my plan` at all. Some parts of the grid middleware we use is tied up to this particular version of python. It's a full production system and we are kind of not allowed to upgrade/change things like python, gcc, libstdc etc. Upgrading python probably won't break anything but I just can't risk the system assuming nothing gonna happen. Also, there is no higher version of rpm package available for RHEL5. We are upgrading to SL6 (RHEL6) in June. Cheers!! – MacUsers Apr 17 '11 at 12:23
@MacUsers: python has v2.4 version in RHEL5. `yum whatprovides /usr/bin/python` -> `Repo: base; python-2.4.3-43.el5`. Python 2.4 has `itertools.groupby()`. – jfs Apr 17 '11 at 13:59
@dugres In your code, **stats** is a dictionary whose elements are dictionaries all having the same keys 'I','R','H'. I don't like data structures containing such repetitive information that eats memory. Anyway, if we don't care about the memory, your code is fine and I upvote. - Note that the two lines ``n = u.get(irh, 0)`` and ``u[irh] = n+1`` can be replaced with ``u[irh] = u.setdefault(irh,0) + 1`` so your code becomes still shorter. – eyquem Apr 17 '11 at 15:13
@eyquem: I've added solution based on a flat dictionary though I doubt that memory usage is relevant in this case http://stackoverflow.com/questions/5692414/how-to-calculate-number-of-items-in-per-user-groupby-item/5694299#5694299 – jfs Apr 17 '11 at 16:20
@dugres: `stats[user]=u` line is redundant. You're already changing `stats[user]` while doing `u[irh] = n+1` – jfs Apr 17 '11 at 16:32
@eyquem: Sorry, I meant to say SL4 *not* SL/RHEL5. I know you are laughing but that's what we have to use for the service nodes until June'11 and then SL5 until SL6 gets certified. – MacUsers Apr 17 '11 at 21:18

eyquem · Answer 2 · 2011-04-17T21:54:01.993

2

data = 112*'cms017 R\n'

data = data + '''atl001 I
cms017 R
atl001 I
cms017 H
atl001 R
lhcabc003 H
cms017 R
lhcabc003 H
lhcabc003 R
cms017 R
cms017 R
cms017 R'''
print data,'\n'

stats = {}
d = {'I':0,'R':1,'H':2}
L = 0
for line in data.splitlines():
    user,irh = line.split()
    stats.setdefault(user,[0,0,0])
    stats[user][d[irh]] += 1
    L = max(L, len(user))

LL = len(str(max(max(stats[user])
                 for user in stats )))

cale = ' %%%ds %%%ds %%%ds' % (LL,LL,LL)
ch = 'user'.ljust(L) + cale % ('I','R','H')

print '%s\n%s' % (ch, len(ch)*'=')
print '\n'.join(user.ljust(L) + cale % tuple(stats[user])
                for user in sorted(stats.keys()))

result

user        I   R   H
=====================
atl001      2   1   0
cms017      0 117   1
lhcabc003   0   1   2

.

Also:

data = 14*'cms017 R\n'

data = data + '''atl001 I
cms017 R
atl001 I
cms017 H
atl001 R
lhcabc003 H
cms017 R
lhcabc003 H
lhcabc003 R
cms017 R
cms017 R
cms017 R'''
print data,'\n'

Y = {}
L = 0
for line in data.splitlines():
    user,irh = line.split()
    L = max(L, len(user))
    if (user,irh) not in Y:
        Y.update({(user,'I'):0,(user,'R'):0,(user,'H'):0})
    Y[(user,irh)] += 1

LL = len(str(max(x for x in Y.itervalues())))

cale = '%%-%ds %%%ds %%%ds %%%ds' % (L,LL,LL,LL)
ch = cale % ('user','I','R','H')

print '%s\n%s' % (ch, len(ch)*'=')
li = sorted(Y.keys())
print '\n'.join(cale % (a[0],Y[b],Y[c],Y[a])
                for a,b,c in (li[x:x+3] for x in xrange(0,len(li),3)))

result

user       I  R  H
==================
atl001     2  1  0
cms017     0 19  1
lhcabc003  0  1  2

.

PS:

The names of users are all justified in a number L of characters

In my code the columns, to avoid complexity as in the Sebastian's code, I, R , H are justified in the same number LL of characters, which is the max of all the results present in this columns

edited Apr 17 '11 at 21:54

answered Apr 17 '11 at 13:26

eyquem

26,771
7
38
46

@eyquem: Could you explain this line : `LL = len(str(max(max(stats[user]) for user in stats )))` to me please? As usual, I'm getting the very same "syntax error" on this line where the `for` is. I don't understand why do I get the syntax error in exactly the same place every time you form a line like this; I must be doing something wrong to understand it. Any idea what am I missing? Cheers!! – MacUsers Apr 18 '11 at 23:27
@MacUsers ``stats[user] for user in stats`` is succesively **[2,1,0]** , **[0,117,1]** and **[0,1,2]** then **max(stats[user])** is succesively 2, 117, 2. Hence ``max(max(stats[user]) for user in stats )`` is 117 : it is the maximum number of occurences observed for one user and one status (cms017 and R in my exemple). What interests me after that is to determine the length of the writing of this number, in order to justify all the numbers of occurences for all users and all statuses in the same width (3 characters in my exemple). – eyquem Apr 18 '11 at 23:54
@MacUsers You use one of the Python 2.3 versions and generator expressions were introduced only in Python 2.4: see _http://www.python.org/dev/peps/pep-0289/_ : **PEP 289 -- Generator Expressions** : _"BDFL Pronouncements : This PEP is ACCEPTED for Py2.4"_ Iterations comprised between parentheses are generator expressions... that you can't use. In your far land where modernism attains with delay , you must replace generator expressions with generators functions , or with list comprehensions. In the present case , write ``LL = len(str(max[max(stats[user]) for user in stats ]))`` – eyquem Apr 19 '11 at 00:17
@eyquem : Even `LL = len(str(max[max(stats[user]) for user in stats ]))` returns the same "syntax error'. Just tried your code on v2.4 and it worked just fine. Cheers!! – MacUsers Apr 19 '11 at 00:37
@MacUsers Do you succeed to write comprhension lists ? For exemple, define ``li = [12,45,13,2,8,16,178,12,45,45]`` and then ``newli = [x for x in li if x<20]`` . Does it work or not ? By the way, what is the precise version you use: 2.3, 2.3.1 , 2.3.2 ...? – eyquem Apr 19 '11 at 00:51
@eyquem: Yes, that works: I get `[12, 13, 2, 8, 16, 12]`. It's v2.3.4 – MacUsers Apr 19 '11 at 02:06
@MacUsers Put a line ``zaza = [ 1 for user in stats ]`` in the code. Does it work ? Then put a line ``zozo = [user for user in stats]`` . Does it work ? Then try ``zuzu = [stats[user] for user in stats]`` . Does it work ? etc... Try various things, try to find what is the minimum changing on something that works that makes it to stop working. You must search yourself, it's difficult to give advices at distance – eyquem Apr 19 '11 at 05:53

score 1 · Answer 3 · answered Apr 17 '11 at 08:46

Well, using groupby for this problem makes no sense anyway. For starters, your data isn't sorted (groupby doesn't sort the groups for you), and the lines are very simple.

Just keep count as you process each line. I am assuming you don't know what flags you'll get:

from sets import Set as set # python2.3 compatibility
counts = {} # counts stored in user -> dict(flag=counter) nested dicts
flags = set()
for line in inputfile:
    user, flag = line.strip().split()
    usercounts = counts.setdefault(user, {})
    usercounts[flag] = usercounts.setdefault(flag, 0) + 1
    flags.add(flag)

Printing the info after that is a question of iterating over your counts structure. I am assuming usernames are always 6 characters long:

flags = list(flags)
flags.sort()
users = counts.keys()
users.sort()
print "user  %s" % ('  '.join(flags))
print "=" * (6 + 3 * len(flags))
for user in users:
    line = [user]
    for flag in flags:
        line.append(counts[user].get(flag, 0))
    print '  '.join(line)

All code above is untested, but should roughly work.

usernames are not always 6 characters long, but I can work that out to print the output. I already have a function to print the header (and the result) in a particular format and I think your code will fit in there. I'll report here back when I'm done. Thanks for the help. Cheers!! — MacUsers, Apr 17 '11 at 10:56

jfs · Answer 4 · 2011-04-17T16:25:42.807

Here's a variant that uses nested dicts to count job statuses and computes max field widths before printing:

#!/usr/bin/env python
import fileinput
from sets import Set as set # python2.3

# parse job statuses
counter = {}
for line in fileinput.input():
    user, jobstatus = line.split()
    d = counter.setdefault(user, {})
    d[jobstatus] = d.setdefault(jobstatus, 0) + 1

# print job statuses
# . find field widths
status_names = set([name for st in counter.itervalues() for name in st])
maxstatuslens = [max([len(str(i)) for st in counter.itervalues()
                      for n, i in st.iteritems()
                      if name == n])
                 for name in status_names]
maxuserlen = max(map(len, counter))
row_format = (("%%-%ds " % maxuserlen) +
              " ".join(["%%%ds" % n for n in maxstatuslens]))
# . print header
header = row_format % (("user",) + tuple(status_names))
print header
print '='*len(header)
# . print rows
for user, statuses in counter.iteritems():
    print row_format % (
        (user,) + tuple([statuses.get(name, 0) for name in status_names]))

Example

$ python print-statuses.py <input.txt
user   I H R
============
lhc003 0 2 1
cms017 1 1 2
atl001 2 0 1

Here's a variant that uses flat dictionary with a tuple (user, status_name) as a key:

#!/usr/bin/env python
import fileinput
from sets import Set as set # python 2.3

# parse job statuses
counter = {}
maxstatuslens = {}
maxuserlen = 0
for line in fileinput.input():
    key = user, status_name = tuple(line.split())
    i = counter[key] = counter.setdefault(key, 0) + 1
    maxstatuslens[status_name] = max(maxstatuslens.setdefault(status_name, 0),
                                     len(str(i)))
    maxuserlen = max(maxuserlen, len(user))

# print job statuses
row_format = (("%%-%ds " % maxuserlen) +
              " ".join(["%%%ds" % n for n in maxstatuslens.itervalues()]))
# . print header
header = row_format % (("user",) + tuple(maxstatuslens))
print header
print '='*len(header)
# . print rows
for user in set([k[0] for k in counter]):
    print row_format % ((user,) +
        tuple([counter.get((user, status), 0) for status in maxstatuslens]))

The usage and output are the same.

@eyquem: nested: `{'atl001': {'I': 2, 'R': 1}}`, flat: `{('atl001', 'I'): 2, ('atl001', 'R'): 1}`. — jfs, Apr 17 '11 at 17:12
@J.F. Sebastian Ok, thank you. That's what I used in my second solution: Y is a flat dictionary — eyquem, Apr 17 '11 at 17:29
@J.F. Sebastian,@eyquem: You guys are amazing! I've learned so many thing for you; it's really appreciated. I'm going though your codes to see which one works best for me. Cheers!! — MacUsers, Apr 18 '11 at 06:41
@J.F. Sebastian: Is there any advantage(s) using flat dict instead of nested? Cheers!! — MacUsers, Apr 18 '11 at 19:37

score 0 · Answer 5 · answered Apr 17 '11 at 08:30

0

As a hint:

Use a nested dictionary structure for counting the occurences:

user -> character -> occurences of the character for user

Writing the parser code and incrementing the counters and printing the result is up to you ...a good exercise.

answered Apr 17 '11 at 08:30

Not everyone (and especially me) an python expert like you. I can do that in bash but I wanted it in python to integrate that part in my existing code. Thanks for the hint though. cheers!! – MacUsers Apr 17 '11 at 08:51
Then learn it...or do it in bash. Stackoverflow is not about write-my-code :) – Apr 17 '11 at 08:55
With due respect, if replying is a problem for you, then don't reply; whatever you know, just keep it to yourself. I never asked you to write my code, I just asked for some suggestions and a sample code always helps better then your "theory from the book". I don't really need to know from you what should I use - bash or python or something else. If you can't help, then don't, just just make it worse. My apology to everyone if I asked a question which I should have asked in the first place. Thanks! – MacUsers Apr 17 '11 at 10:56

how to calculate number of items in per user groupby item

5 Answers5

Example