-1

I have the following code in jupyter notebook:  

import h5py
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_hdf('accounting-2018-10-deid.h5', 'table')
df.columns
Out[4]:
Index(['group', 'owner', 'job_number', 'submission_time', 'start_time',
   'end_time', 'failed', 'exit_status', 'granted_pe', 'slots',
   'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive',
   'h_vmem', 'gpu', 'pe', 'slot', 'wait_time', 'wtime', 'campus'],
  dtype='object')

The meanings of the columns:

owner: the owner of a job
group: the group a owner belongs to; a group can have one or more owners

The task is: For each group, list the number of users, and list all of these users (i.e. the users having the same “group” field). For example: group 1 (4 users): user2, user32, user41, user56?

I tried to use groupby() but didn't get right answer. Please, help me.

keineahnung2345
  • 2,635
  • 4
  • 13
  • 28
Tal Nur
  • 19
  • 3
  • 2
    Please read [How to create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) – Sheldore Jan 18 '19 at 01:15

1 Answers1

0

Does this work for you?

import pandas as pd

df = pd.DataFrame({"owner": ["Allen", "Bob", "Cindy", "David", "Emily", "Frank"],
                   "group": ["A", "C", "B", "C", "B", "B"]})

groups = df.groupby("group")
for group in groups:
    print('There are {} owners in group {}'.format(group[1].shape[0], group[0]))
    print('They are {}.'.format(group[1].owner.to_string(index=False).replace('\n', ', ')))
    print()
keineahnung2345
  • 2,635
  • 4
  • 13
  • 28
  • Hi keineahnung2345, I did as you adviced, but get MemoryError MemoryError Traceback (most recent call last) in 1 groups = df.groupby("group") ----> 2 for group in groups: 3 # print('There are {} owners in group {}'.format(group[1].shape[0], group[0])) 4 print(group[1]) 5 # print('They are {}.'.format(group[1].owner.to_string(index=False).replace('\n', ', '))) ..... MemoryError: – Tal Nur Jan 18 '19 at 16:29
  • @TalNur I guess it's because your dataset is too large. Could you try this method with smaller dataset and see if it works? You can also try adding `low_memory=False` or `usecols=['group', 'owner']` in `pd.read_hdf()` just like https://stackoverflow.com/questions/17557074/memory-error-when-using-pandas-read-csv/47230263#47230263 and https://stackoverflow.com/questions/26063231/read-specific-columns-with-pandas-or-other-python-module suggests, and see if they works. – keineahnung2345 Jan 19 '19 at 00:16