Plot Count of Pandas Dataframe with Start_Date and End_Date

Question

I am trying to plot a daily follower count for various twitter handles. The result being something like what you see below, but filterable by more than 1 twitter handle:

Usually, I would do this by simply appending a new dataset pulled from Twitter to the original table, with the date of the log being pulled. However, this would make me end up with a million lines in just a few days. And it wouldn't allow me to clearly see when a user has dropped off.

As an alternative, after pulling my data from Twitter, I structured my pandas dataframe like this:

Follower_ID          Handles    Start_Date  End_Date
100                  x          30/05/2017  NaN
101                  x          21/04/2017  29/05/2017
201                  y          14/06/2017  NaN
100                  y          16/06/2017  28/06/2017

Where:

Handles:are the accounts I am pulling the Followers for
Follower_ID:is the user following an handle

So, for example, if I wereFollower_ID 100, I could follow both handle x and handle y

I am wondering what would be the best way to prepare the data (pivot, clean through a function, groupby) so that then it can be plotted accordingly. Any ideas?

I may be missing something, but could you elaborate on the meanings of `Follower_ID` and `Handles` in your example DataFrame? Each handle has two different follower IDs, and follower ID 100 has two different handles. — Peter Leimbigler, Jun 30 '17 at 14:35
@PeterLeimbigler yes, let me update the question, sorry, I can see how this could be confusing! — Matt M, Jun 30 '17 at 14:53

Niels Joaquin · Accepted Answer · 2017-07-01T15:10:54.750

I ended up using iterrows in a naïve approach, so there could be a more efficient way that takes advantage of pandas reshaping, etc. But my idea was to make a function that takes in your dataframe and the handle you want to plot, and then returns another dataframe with that handle's daily follower counts. To do this, the function

filters the df to the desired handle only,
takes each date range (for example, 21/04/2017 to 29/05/2017),
turns that into a pandas date_range, and
puts all the dates in a single list.

At that point, collections.Counter on the single list is a simple way to tally up the results by day.

One note is that the null End_Dates should be coalesced to whatever end date you want on your graph. I call that the max_date when I wrangle the data. So altogether:

from io import StringIO
from collections import Counter
import pandas as pd

def get_counts(df, handle):
    """Inputs: your dataframe and the handle
    you want to plot.

    Returns a dataframe of daily follower counts.
    """

    # filters the df to the desired handle only
    df_handle = df[df['Handles'] == handle]

    all_dates = []

    for _, row in df_handle.iterrows():
        # Take each date range (for example, 21/04/2017 to 29/05/2017),
        # turn that into a pandas `date_range`, and
        # put all the dates in a single list
        all_dates.extend(pd.date_range(row['Start_Date'],
                                       row['End_Date']) \
                           .tolist())

    counts = pd.DataFrame.from_dict(Counter(all_dates), orient='index') \
                         .rename(columns={0: handle}) \
                         .sort_index()

    return counts

That's the function. Now reading and wrangling your data ...

data = StringIO("""Follower_ID          Handles    Start_Date  End_Date
100                  x          30/05/2017  NaN
101                  x          21/04/2017  29/05/2017
201                  y          14/06/2017  NaN
100                  y          16/06/2017  28/06/2017""")

df = pd.read_csv(data, delim_whitespace=True)

# fill in missing end dates
max_date = pd.Timestamp('2017-06-30') 
df['End_Date'].fillna(max_date, inplace=True)

# pandas timestamps (so that we can use pd.date_range)
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])

print(get_counts(df, 'y'))

The last line prints this for handle y:

            y
2017-06-14  1
2017-06-15  1
2017-06-16  2
2017-06-17  2
2017-06-18  2
2017-06-19  2
2017-06-20  2
2017-06-21  2
2017-06-22  2
2017-06-23  2
2017-06-24  2
2017-06-25  2
2017-06-26  2
2017-06-27  2
2017-06-28  2
2017-06-29  1
2017-06-30  1

You can plot this dataframe with your preferred package.

Hi Niels, amazing. Thank you for your help, I really couldn't wrap my head around it. Is there a way to make each handle it's own column? — Matt M, Jul 01 '17 at 09:41
@MattM I can get you started on that. I just made an edit so that the column name is the handle. If you do `pd.concat([get_counts(df, 'x'), get_counts(df, 'y')], axis=1)`, that merges `x` and `y` together in the same dataframe for the simple, two-handle case. I leave it up to you to loop over the handles for the _n_ case! — Niels Joaquin, Jul 01 '17 at 15:20

Plot Count of Pandas Dataframe with Start_Date and End_Date

1 Answers1