My data
I am working with the following data from the National Centers for Environmental Information (NCEI) - obtained simply by using pandas' read_html()
.
df = pd.read_html('https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/statewide/rankings/1/tavg/202302')[0]
df.head()
Period Value 1901-2000Mean Anomaly Rank(1895-2023) Warmest/CoolestSince Record
0 February 2023 1-Month 56.6°F(13.7°C) 48.0°F(8.9°C) 8.6°F(4.8°C) 126th Coolest Coolest since:2022 1895
1 February 2023 1-Month 56.6°F(13.7°C) 48.0°F(8.9°C) 8.6°F(4.8°C) 4th Warmest Warmest since:2018 2018
2 Jan–Feb 2023 2-Month 54.0°F(12.2°C) 46.5°F(8.1°C) 7.5°F(4.1°C) 127th Coolest Coolest since:2022 1978
3 Jan–Feb 2023 2-Month 54.0°F(12.2°C) 46.5°F(8.1°C) 7.5°F(4.1°C) 3rd Warmest Warmest since:2017 1950
My problem/desired output
The Rank (YYYY-YYYY)
and Warmest/CoolestSince
columns have two "levels" to them, meaning each row is replicated to account for each level. My desired output would be the following
Period Value 1901-2000Mean Anomaly CoolestRank(1895-2023) CoolestSince CoolestRecord WarmestRank(1895-2023)WarmestSince WarmestRecord
0 February 2023 1-Month 56.6°F(13.7°C) 48.0°F(8.9°C) 8.6°F(4.8°C) 126th Coolest Coolest since:2022 1895 4th Warmest Warmest since:2018 2018
1 Jan–Feb 2023 2-Month 54.0°F(12.2°C) 46.5°F(8.1°C) 7.5°F(4.1°C) 127th Coolest Coolest since:2022 1978 3rd Warmest Warmest since:2017 1950
So basically, I'd like to combine all unique Period
rows and add new columns to take care of the two columns with multiple levels.
What I have tried
My initial thought was something not very practical - what if I took the two Rank (YYYY-YYYY)
and Warmest/CoolestSince
values from each odd row (the second of each duplicated Period
and use those to populate two new columns, then drop each odd row? Something like this could be done, but probably isn't very efficient, pythonic, etc.
Then I thought, since we have groups of unique Period
values, that maybe I could do some sort of groupby
magic? Since the warmth related columns are always the lower level, I can do something like:
(df
.groupby('Period')
.agg(
Value=('Value','first'),
Anomaly=('Anomaly','first'),
CoolestSince=('Warmest/CoolestSince','first'),
CoolestRecord=('Record','first'),
WarmestSince=('Warmest/CoolestSince','last'),
WarmestRecord=('Record','last'),
)
)
to achieve my desired output. However, this type of aggregation doesn't work for assigning columns with non-alphabetic characters. I get an error when I try to add the other columns:
(df
.groupby('Period')
.agg(
Value=('Value','first'),
1901-2000Mean=('1901-2000Mean','first'),
Anomaly=('Anomaly','first'),
CoolestRank(1895-2023)=('Rank(1895-2023)','first'),
CoolestSince=('Warmest/CoolestSince','first'),
CoolestRecord=('Record','first'),
WarmestRank(1895-2023)=('Rank(1895-2023)','last'),
WarmestSince=('Warmest/CoolestSince','last'),
WarmestRecord=('Record','last'),
)
)
SyntaxError: positional argument follows keyword argument
Is there a way to assign groupby
columns via strings? I tried formatting it into a dictionary but I couldn't seem to get it to work. Is there an alternative solution that is more efficient? I am curious, as what I am on to seems to be a pretty solid solution.
NOTE: Although not listed in the data preview of this question, occasionally there is a third level to a column for ties between years. These can be ignored, as I am just dropping these rows with df.loc[~df.Value.str.contains('Ties')]
.
Thanks!