1

I think this one is simple but I am missing how to get the data out I want.

I have two fields in a dataframe called ClosedDate and Is_Article_Linked. I want to group this by ClosedDate's Month and then tally only only when Is_Article_Linked is true but still keep the total records.

So for example my code looks like this

data = [
  [ ('ClosedDate', '2021-01-18T12:46:38.000+0000'), ('Is_Article_Linked', True) ],
  [ ('ClosedDate', '2021-01-29T12:46:38.000+0000'), ('Is_Article_Linked', True) ],
  [ ('ClosedDate', '2021-01-11T12:46:38.000+0000'), ('Is_Article_Linked', False) ],
  [ ('ClosedDate', '2021-01-03T12:46:38.000+0000'), ('Is_Article_Linked', False) ],
  [ ('ClosedDate', '2021-01-12T12:46:38.000+0000'), ('Is_Article_Linked', True) ]
]

df_ld = pd.DataFrame(data, columns=['ClosedDate', 'Is_Article_Linked'])
# Formatting the date field
df_ld['ClosedDate'] = pd.to_datetime(df_ld['ClosedDate'],  format="%y-%m-%d", exact=False)
# Grouping by Month
relevant_links = df_ld.resample('M', on='ClosedDate').count()
# Renaming the index
relevant_links.index = relevant_links.index.rename('Date')
# Outputting to  a dictionary
link_accuracy_by_month = relevant_links.reset_index().to_dict('records')
print(link_accuracy_by_month)

What this outputs is [{'Date': Timestamp('2021-01-31 00:00:00'), 'ClosedDate': 5, 'Is_Article_Linked': 5}]

What i am expecting to see is [{'Date': Timestamp('2021-01-31 00:00:00'), 'ClosedDate': 5, 'Is_Article_Linked': 3}]

So ClosedDate is my total records and Is_Article_Linked is only tallying the True's. Currently it is tallying all of them because i assume it is seeing a value in there.

Is there anyway to get the results I am looking for? I believe my answer is in the resample but i just can't figure out how to get it to count just the true's.

Thanks in advance!

Chris Wesseling
  • 6,226
  • 2
  • 36
  • 72
jAC
  • 3,155
  • 3
  • 18
  • 29

1 Answers1

1

Assuming your provided data is not correctly formatted so the dataframe actually looks like:

  ClosedDate  Is_Article_Linked
0 2021-01-18               True
1 2021-01-29               True
2 2021-01-11              False
3 2021-01-03              False
4 2021-01-12               True

You can do this:

df_ld.resample("M", on = "ClosedDate")\
    .agg({"ClosedDate": "count", "Is_Article_Linked":"sum"})\
    .rename_axis("Date")\
    .reset_index()\
    .to_dict("records")

Output

[{'Date': Timestamp('2021-01-31 00:00:00'),
  'ClosedDate': 5,
  'Is_Article_Linked': 3}]
Pablo C
  • 4,661
  • 2
  • 8
  • 24
  • 1
    If this is the shape of the data, `df_ld = pd.DataFrame((dict(r) for r in data))` gives OP the DataFrame that looks like your answer. – Chris Wesseling Apr 21 '21 at 21:53
  • @Pablo C this is exactly what i needed and i learned a few things. I did not realize i could append all the options into one command so bonus points for that as well :) – jAC Apr 22 '21 at 00:19
  • 1
    @jAC nice! Glad to be helpful :) – Pablo C Apr 22 '21 at 00:39
  • @ChrisWesseling your solution also works in case anyone is insterested. – jAC Apr 22 '21 at 00:59