0

My data is organized in a data frame with the following structure

| ID       | Post                | Platform    | 

| -------- | ------------------- | ----------- |

| 1        | Something #hashtag1 | Twitter     |

| 2        | Something #hashtag2 | Insta       |

| 3        | Something #hashtag1 | Twitter     |

I have been able to extract and count the hashtag using the following (using this post):

df.Post.str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')

I am now trying to count hashtag operation occurrence from each platform. I am trying the following:

df.groupby(['Post', 'Platform'])['Post'].str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')

But, I am getting the following error:

AttributeError: 'SeriesGroupBy' object has no attribute 'str'
DotPi
  • 3,977
  • 6
  • 33
  • 53

1 Answers1

1

We can solve this easily using 2 steps.Assumption each post has just single hashtag

    Step 1: Create a new column with Hashtag
    df['hashtag']= df.Post.str.extractall(r'(\#\w+)')[0].reset_index()[0]

    Step 2: Group by and get the counts
    df.groupby([ 'Platform']).hashtag.count()

Generic Solutions Works for any number of hashtag We can solve this easily using 2 steps.

    # extract all hashtag
    df1  = df.Post.str.extractall(r'(\#\w+)')[0].reset_index()
    # Ste index as index of original tagle where hash tag came from
    df1.set_index('level_0',inplace = True)


    df1.rename(columns={0:'hashtag'},inplace = True)

    df2 = pd.merge(df,df1,right_index = True, left_index = True)

   df2.groupby([ 'Platform']).hashtag.count()
Ashwiniku918
  • 281
  • 2
  • 7
  • This is a great starting point. The number of rows in `df2` increases as it creates multiple rows for posts that had more than 1 hashtag. It would be great to create a wide dataframe instead of a long one. – DotPi Jan 10 '22 at 14:00
  • Agreed, but dont you think if we did wide transformation will increase complexity mutifold – Ashwiniku918 Jan 10 '22 at 14:10