2

I want to understand how covid pandemic is affecting the supply chain industry such as meat processing plants. I retrieved NYT covid data by county level and statistical data from food agency, so I want to understand how covid cases are surging in counties where major food processing plants are located. To do so, I figured out the right data and able to make it ready for rendering a nice time series chart. However, I found issues of getting the right plotting data for that because the resulted plot is not getting the expected output. Here is what I tried so far:

my attempt:

Here is the final aggregated covid time series data that I am interested in this gist. Here is my current attempt:

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from datetime import timedelta, datetime

df = pd.read_csv("https://gist.githubusercontent.com/jerry-shad/7eb2dd4ac75034fcb50ff5549f2e5e21/raw/477c07446a8715f043c9b1ba703a03b2f913bdbf/covid_tsdf.csv")
df.drop(['Unnamed: 0', 'fips', 'non-fed-slaughter', 'fed-slaughter', 'total-slaughter', 'mcd-asl'], axis=1, inplace=True)
for ct in df['county_state'].unique():
    dd = df.groupby([ct, 'date', 'est'])['num-emp'].sum().unstack().reset_index()
    p = sns.lineplot('date', 'values', data=dd, hue='packer', markers=markers, style='cats', ax=axes[j, 0])
    p.set_xlim(data.date.min() - timedelta(days=60), data.date.max() + timedelta(days=60))
    plt.legend(bbox_to_anchor=(1.04, 0.5), loc="center left", borderaxespad=0)

but looks I made the wrong aggregation above, this attempt is not working. My intention is basically if a company has multiple establishments (a.k.a est), then I need to take sum of its num-emp: # of employees, then get the ratio of # of new_deaths / num-emp along the time. Basically I want to track whether company's staff are affected by covid or not with some approximate sense. I am not quite sure what would be the correct way of doing this with matplotlib in python. Can anyone suggest possible of correction to make this right? Any idea?

second attempt

I got some inspiration from recent covid19 related post, so this is another way of trying to do what I want to make in matplotlib. I aggregated data in this way with custom plotting helper function also:

df = pd.read_csv("https://gist.githubusercontent.com/jerry-shad/7eb2dd4ac75034fcb50ff5549f2e5e21/raw/477c07446a8715f043c9b1ba703a03b2f913bdbf/covid_tsdf.csv")
ds_states = df.groupby('county_state').sum().rename({'county_state': 'location'})
ds_states['mortality'] = ds_states['deaths'] / ds_states['popestimate2019'] * 1_000_000
ds_states['daily_mortality'] = ds_states['new_deaths'] / ds_states['popestimate2019'] * 1_000_000
ds_states['daily_mortality7'] = ds_states['daily_mortality'].rolling({'time': 7}).mean()

then this is plotting helper function that I came up:

def subplots(*args, tick_right=True, **kwargs):
    f, ax = plt.subplots(*args, **kwargs)

    if tick_right:
        ax.yaxis.tick_right()
        ax.yaxis.set_label_position("right")
    ax.yaxis.grid(color="lightgrey", linewidth=0.5)
    ax.xaxis.grid(color="lightgrey", linewidth=0.5)
    ax.xaxis.set_tick_params(labelsize=14)
    return f, ax

 _, ax1 = subplots(subplot_kw={'xlim': XLIM})
ax1.set(title=f'US covid tracking in meat processing plants by county - Linear scale')
ax2 = ax1.twinx()

but I trapped again here how to make this right. My essential goal is basically whether how much meat processing companies are affected by covid because if its worker got infected by covid, companies' performance will be dropped. I want to make eda that provides this sort of information visually. Can anyone suggest possible ways of doing this with matplotlib? I am open to any feasible eda attempt that makes this question more realistic or meaningful.

desired output

I am thinking about to make eda output something like below:

enter image description here

what I want to see, by county level, how every company's performance is varied because of covid. Can anyone point me out anyway to achieve possible eda output? Thanks

update

since what kind od eda that I want to make is not quite solid in my mind, so I am open to hearing any possible eda that fit the context of the problem that I raised above. Thanks in advance!

kim
  • 556
  • 7
  • 28

1 Answers1

2

We have graphed the moving average of the number of outbreaks and new outbreaks for one state only. The process involved adding the moving average columns to the data frame extracted for a particular state and drawing a two-axis graph.

ct = 'Maricopa_Arizona'
dd = df[df['county_state'] == ct].groupby(['county_state', 'date', 'est'])[['cases','new_cases']].sum().unstack().reset_index()
dd.columns= ['county_state','date', 'cases', 'new_cases']
dd['date'] = pd.to_datetime(dd['date'])
dd['rol7'] = dd[['date','new_cases']].rolling(7).mean()

dd.tail()
county_state    date    cases   new_cases   exp7    rol7
216 Maricopa_Arizona    2020-08-29  133389.0    403.0   306.746942  243.428571
217 Maricopa_Arizona    2020-08-30  133641.0    252.0   293.060207  264.857143
218 Maricopa_Arizona    2020-08-31  133728.0    87.0    241.545155  252.285714
219 Maricopa_Arizona    2020-09-01  134004.0    276.0   250.158866  244.857143
220 Maricopa_Arizona    2020-09-02  134346.0    342.0   273.119150  273.142857

fig = plt.figure(figsize=(8,6),dpi=144)
ax = fig.add_subplot(111)

colors = sns.color_palette()
ax2 = ax.twinx()

ax = sns.lineplot('date', 'rol7', data=dd, color=colors[1], ax=ax)
ax2 = sns.lineplot('date', 'cases', data=dd, color=colors[0], ax=ax2)

ax.set_xlim(dd.date.min(), dd.date.max())
fig.legend(['rolling7','cases'],loc="upper left", bbox_to_anchor=(0.01, 0.95), bbox_transform=ax.transAxes)
ax.grid(axis='both', lw=0.5)

locator = mdates.AutoDateLocator()
ax.xaxis.set_major_locator(locator)

fig.autofmt_xdate(rotation=45)
ax.set(title=f'US covid tracking in meat processing plants by county - Linear scale')
plt.show()

enter image description here

r-beginners
  • 31,170
  • 3
  • 14
  • 32
  • thanks. Is there any way we can see the graph where new cases numbers are implicitly related to `num-emp` in meat processing plants in each county? I mean, can we make a graph that describes the relationship between new cases surges with `num-emp` or population along the time? Any possible thoughts about that? Thanks! – kim Sep 10 '20 at 14:49
  • 2
    I think we can replace the two graphs you answered with 'num-emp' and newly infected and draw the counties we need in the loop process. At that point, it would be better to cluster them by company characteristics, etc. and expand them into multiple graphs to prevent the lines from overlapping and becoming difficult to see. – r-beginners Sep 11 '20 at 02:27
  • It's difficult to communicate in the comments, so I suggest you post a new question as a second step. I think you'll get more answers faster that way. – r-beginners Sep 11 '20 at 03:15