0

This is a basic question so apologies in advance.

I am using Pandas and I am grouping data with the following line:

page_serp_df.groupby([page_serp_df.meta_keywords_1_length]).count()['keyword']

This is referencing the following:

  • The data frame: [page_serp_df]
  • Grouping by the column: meta_keywords_1_length
  • Counting with the filter: keyword column

What I don't understand is why does the filtering condition have to be ['keyword'] i.e. a string in quotes? For example, this doesn't work and it is very counterintuituve to me:

page_serp_df.groupby([page_serp_df.meta_keywords_1_length]).count()[page_serp_df.keyword]

Thanks in advance!

Vaziri-Mahmoud
  • 152
  • 1
  • 10
Sandy Lee
  • 221
  • 1
  • 9
  • that's not the right way to create a filter to groupby, think of it this way, you group your entire result set (the dataframe) select the column OR condition to apply your group by too, then select your target column with a specific type of aggregation. Is keyword a column or a variable in a column you need to filter by ? – Umar.H Sep 23 '20 at 11:31
  • Thanks for getting back. 'Keyword' is a column. – Sandy Lee Sep 23 '20 at 11:34
  • try `page_serp_df.groupby([page_serp_df['meta_keywords_1_length']])['keyword'].count()` – Umar.H Sep 23 '20 at 11:43
  • So I guess that is where I am struggling conceptually: the difference between the dot-notation (i.e. page_serp_df.meta_keywords_1_length) and the square bracket (i.e. [page_serp_df['meta_keywords_1_length']]). What is the difference and which is best practice? – Sandy Lee Sep 23 '20 at 11:51
  • 1
    check [this](https://stackoverflow.com/questions/41130255/for-pandas-dataframe-whats-the-difference-between-using-squared-brackets-or-do) – jezrael Sep 23 '20 at 11:57

1 Answers1

1

I think there is a misunderstanding on what the .count() method returns.

Try to follow this example:

Create a sample data frame

df = pd.DataFrame({
    'A':[0,1,0,1, 1],
    'B':[100,200,300, 400, 500],
    'C': [1,2,3,4,5]
})

This is what the count() method will return after groupby

# similarly to your example I am grouping by A and counting 
df.groupby([df.A]).count()

enter image description here

As you can see, the count() method returns a dataframe itself, having the count of each other column values for the column where the grouped column has the same value. After that, you can query for a specific column form the return of count() like this

df.groupby([df.A]).count()['C']

But the second case in your example, which in my example would correspond to df.groupby([df.A]).count()[df.C]

Will throw an error!

enter image description here

In fact, you would query a dataframe (in this case df.groupby([df.A]).count()) via a pandas Series but as you know you need a string or a column from df.columns.

You can check yourself that df.C and 'C' are two very different variable types.

print(type(df.C))
print(type('C'))
# <class 'pandas.core.series.Series'>
# <class 'str'>

If for some reason your code still works with the equivalent of df.C there might be some contingency like the only value of the df.C is a string with the same name of a column.. or something unintentional like that.

JacoSolari
  • 1,226
  • 14
  • 28