1

I'm using spark version 2.0.1 & python 2.7. I'm running following code

# This will return a new DF with all the columns + id
data1 = data.withColumn("id", monotonically_increasing_id()) # Create an integer index
data1.show()

def create_indexes(df,
                   fields=['country', 'state_id', 'airport', 'airport_id']):
    """ Create indexes for the different element ids
        for CMRs. This allows us to select CMRs that match
        a given element and element value very quickly.
    """
    if fields == None:
        print("No fields specified, returning")
        return
    for field in fields:
        if field not in df.columns:
            print('field: ', field, " is not in the data...")
            return
    indexes = {}
    for field in fields:
        print(field)
        res = df.groupby(field)
        index = {label: np.array(vals['id'], np.int32) for label, vals in res}
        indexes[field] = index
    return indexes

# Create indexes. Some of them take a lot of time!
#Changed dom_client_id by gbl_buy_grp_id as it was changed in Line Number 
indexes = create_indexes(data1, fields=['country', 'state_id', 'airport', 'airport_id'])
print type(indexes)

I'm getting following error message While running this code

TypeError: 'GroupedData' object is not iterable

Can you please help me to solve this issue?

Python Learner
  • 437
  • 2
  • 11
  • 28

1 Answers1

2

You have to perform an aggregation on the GroupedData and collect the results before you can iterate over them e.g. count items per group: res = df.groupby(field).count().collect()

Bernhard
  • 8,583
  • 4
  • 41
  • 42
  • Thank you Bernhard for your comment. But actually I'm creating some index & returning it. Please refer the code `index = {label: np.array(vals['id'], np.int32) for label, vals in res} indexes[field] = index`. I'm not getting how to use collect() here – Python Learner Oct 17 '17 at 13:44
  • Any solution please? – Python Learner Oct 17 '17 at 16:12
  • Well, I don't know what you want to achieve. groupby will group your data based on the field attribute you specify. With the grouped data, you have to perform an aggregation, e.g. get the count, sum, average... of values in that group. To tell Spark to actually do the work and return results you have to perform a collect operation. This returns a list of row objects over which you can iterate. Check the examples in the documentation, they demonstrate it quite well. – Bernhard Oct 17 '17 at 18:33
  • Thank you Bernhard. After groupby, I need to take the value of groupby not any further aggregation. Is there any way to avoid further aggregation & take the value of groupby? – Python Learner Oct 17 '17 at 18:44
  • Hi Bernhard, after groupby(), I need to get distinct element. I tried `res = df.groupby(field).distinct().collect()` & got error message that `AttributeError: 'GroupedData' object has no attribute 'distinct`. Is there any function other similar to distinct() for GroupedData object? – Python Learner Oct 19 '17 at 19:08
  • Did you check the answer I referenced? Instead of collect_list you can use collect_set (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.functions.collect_set) to get distinct values of a column. – Bernhard Oct 19 '17 at 19:15
  • Thank you Bernhard. Yes, I saw your answer. I found some difficulties as I'm using more than 15 attributes in my model.I also tried collect_set() but got error message `AttributeError: 'GroupedData' object has no attribute 'collect_set'` – Python Learner Oct 19 '17 at 20:14
  • OK I found a similar post https://stackoverflow.com/questions/37580782/pyspark-collect-set-or-collect-list-with-groupby. I will go through it – Python Learner Oct 19 '17 at 20:18