2

I am trying to plot a daily follower count for various twitter handles. The result being something like what you see below, but filterable by more than 1 twitter handle:

Follower Count

Usually, I would do this by simply appending a new dataset pulled from Twitter to the original table, with the date of the log being pulled. However, this would make me end up with a million lines in just a few days. And it wouldn't allow me to clearly see when a user has dropped off.

As an alternative, after pulling my data from Twitter, I structured my pandas dataframe like this:

Follower_ID          Handles    Start_Date  End_Date
100                  x          30/05/2017  NaN
101                  x          21/04/2017  29/05/2017
201                  y          14/06/2017  NaN
100                  y          16/06/2017  28/06/2017

Where:

  • Handles:are the accounts I am pulling the Followers for
  • Follower_ID:is the user following an handle

So, for example, if I wereFollower_ID 100, I could follow both handle x and handle y

I am wondering what would be the best way to prepare the data (pivot, clean through a function, groupby) so that then it can be plotted accordingly. Any ideas?

Matt M
  • 309
  • 4
  • 16
  • 2
    I may be missing something, but could you elaborate on the meanings of `Follower_ID` and `Handles` in your example DataFrame? Each handle has two different follower IDs, and follower ID 100 has two different handles. – Peter Leimbigler Jun 30 '17 at 14:35
  • 2
    @PeterLeimbigler yes, let me update the question, sorry, I can see how this could be confusing! – Matt M Jun 30 '17 at 14:53

1 Answers1

1

I ended up using iterrows in a naïve approach, so there could be a more efficient way that takes advantage of pandas reshaping, etc. But my idea was to make a function that takes in your dataframe and the handle you want to plot, and then returns another dataframe with that handle's daily follower counts. To do this, the function

  • filters the df to the desired handle only,
  • takes each date range (for example, 21/04/2017 to 29/05/2017),
  • turns that into a pandas date_range, and
  • puts all the dates in a single list.

At that point, collections.Counter on the single list is a simple way to tally up the results by day.

One note is that the null End_Dates should be coalesced to whatever end date you want on your graph. I call that the max_date when I wrangle the data. So altogether:

from io import StringIO
from collections import Counter
import pandas as pd

def get_counts(df, handle):
    """Inputs: your dataframe and the handle
    you want to plot.

    Returns a dataframe of daily follower counts.
    """

    # filters the df to the desired handle only
    df_handle = df[df['Handles'] == handle]

    all_dates = []

    for _, row in df_handle.iterrows():
        # Take each date range (for example, 21/04/2017 to 29/05/2017),
        # turn that into a pandas `date_range`, and
        # put all the dates in a single list
        all_dates.extend(pd.date_range(row['Start_Date'],
                                       row['End_Date']) \
                           .tolist())

    counts = pd.DataFrame.from_dict(Counter(all_dates), orient='index') \
                         .rename(columns={0: handle}) \
                         .sort_index()

    return counts

That's the function. Now reading and wrangling your data ...

data = StringIO("""Follower_ID          Handles    Start_Date  End_Date
100                  x          30/05/2017  NaN
101                  x          21/04/2017  29/05/2017
201                  y          14/06/2017  NaN
100                  y          16/06/2017  28/06/2017""")

df = pd.read_csv(data, delim_whitespace=True)

# fill in missing end dates
max_date = pd.Timestamp('2017-06-30') 
df['End_Date'].fillna(max_date, inplace=True)

# pandas timestamps (so that we can use pd.date_range)
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])

print(get_counts(df, 'y'))

The last line prints this for handle y:

            y
2017-06-14  1
2017-06-15  1
2017-06-16  2
2017-06-17  2
2017-06-18  2
2017-06-19  2
2017-06-20  2
2017-06-21  2
2017-06-22  2
2017-06-23  2
2017-06-24  2
2017-06-25  2
2017-06-26  2
2017-06-27  2
2017-06-28  2
2017-06-29  1
2017-06-30  1

You can plot this dataframe with your preferred package.

Niels Joaquin
  • 1,205
  • 1
  • 12
  • 14
  • Hi Niels, amazing. Thank you for your help, I really couldn't wrap my head around it. Is there a way to make each handle it's own column? – Matt M Jul 01 '17 at 09:41
  • 1
    @MattM I can get you started on that. I just made an edit so that the column name is the handle. If you do `pd.concat([get_counts(df, 'x'), get_counts(df, 'y')], axis=1)`, that merges `x` and `y` together in the same dataframe for the simple, two-handle case. I leave it up to you to loop over the handles for the _n_ case! – Niels Joaquin Jul 01 '17 at 15:20