0

My data

I am working with the following data from the National Centers for Environmental Information (NCEI) - obtained simply by using pandas' read_html().

df = pd.read_html('https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/statewide/rankings/1/tavg/202302')[0]

df.head()


                   Period            Value  1901-2000Mean   Anomaly Rank(1895-2023) Warmest/CoolestSince    Record
0   February 2023 1-Month   56.6°F(13.7°C)  48.0°F(8.9°C)   8.6°F(4.8°C)    126th Coolest   Coolest since:2022  1895
1   February 2023 1-Month   56.6°F(13.7°C)  48.0°F(8.9°C)   8.6°F(4.8°C)    4th Warmest Warmest since:2018  2018
2   Jan–Feb 2023 2-Month    54.0°F(12.2°C)  46.5°F(8.1°C)   7.5°F(4.1°C)    127th Coolest   Coolest since:2022  1978
3   Jan–Feb 2023 2-Month    54.0°F(12.2°C)  46.5°F(8.1°C)   7.5°F(4.1°C)    3rd Warmest Warmest since:2017  1950

My problem/desired output

The Rank (YYYY-YYYY) and Warmest/CoolestSince columns have two "levels" to them, meaning each row is replicated to account for each level. My desired output would be the following

                   Period            Value  1901-2000Mean   Anomaly      CoolestRank(1895-2023) CoolestSince    CoolestRecord WarmestRank(1895-2023)WarmestSince WarmestRecord
0   February 2023 1-Month   56.6°F(13.7°C)  48.0°F(8.9°C)   8.6°F(4.8°C)        126th Coolest   Coolest since:2022  1895    4th Warmest       Warmest since:2018    2018
1   Jan–Feb 2023 2-Month    54.0°F(12.2°C)  46.5°F(8.1°C)   7.5°F(4.1°C)        127th Coolest   Coolest since:2022  1978    3rd Warmest       Warmest since:2017    1950

So basically, I'd like to combine all unique Period rows and add new columns to take care of the two columns with multiple levels.

What I have tried

My initial thought was something not very practical - what if I took the two Rank (YYYY-YYYY) and Warmest/CoolestSince values from each odd row (the second of each duplicated Period and use those to populate two new columns, then drop each odd row? Something like this could be done, but probably isn't very efficient, pythonic, etc.

Then I thought, since we have groups of unique Period values, that maybe I could do some sort of groupby magic? Since the warmth related columns are always the lower level, I can do something like:

(df
 .groupby('Period')
 .agg(
     Value=('Value','first'),
     Anomaly=('Anomaly','first'),
     CoolestSince=('Warmest/CoolestSince','first'),
     CoolestRecord=('Record','first'),
     WarmestSince=('Warmest/CoolestSince','last'),
     WarmestRecord=('Record','last'),
     )
 )

to achieve my desired output. However, this type of aggregation doesn't work for assigning columns with non-alphabetic characters. I get an error when I try to add the other columns:

(df
 .groupby('Period')
 .agg(
     Value=('Value','first'),
     1901-2000Mean=('1901-2000Mean','first'),
     Anomaly=('Anomaly','first'),
     CoolestRank(1895-2023)=('Rank(1895-2023)','first'),
     CoolestSince=('Warmest/CoolestSince','first'),
     CoolestRecord=('Record','first'),
     WarmestRank(1895-2023)=('Rank(1895-2023)','last'),
     WarmestSince=('Warmest/CoolestSince','last'),
     WarmestRecord=('Record','last'),
     )
 )


SyntaxError: positional argument follows keyword argument

Is there a way to assign groupby columns via strings? I tried formatting it into a dictionary but I couldn't seem to get it to work. Is there an alternative solution that is more efficient? I am curious, as what I am on to seems to be a pretty solid solution.

NOTE: Although not listed in the data preview of this question, occasionally there is a third level to a column for ties between years. These can be ignored, as I am just dropping these rows with df.loc[~df.Value.str.contains('Ties')].

Thanks!

Jamiu S.
  • 5,257
  • 5
  • 12
  • 34
bismo
  • 1,257
  • 1
  • 16
  • 36

1 Answers1

0

Try this approach, assigning string in df.agg() and then df.droplevel(), and finally df.reset_index()

df = (df.groupby('Period')
        .agg({
            'Value': 'first',
            '1901-2000Mean': 'first',
            'Anomaly': 'first',
            'Rank(1895-2023)': [('CoolestRank(1895-2023)', 'first'), ('WarmestRank(1895-2023)', 'last')],
            'Warmest/CoolestSince': [('CoolestSince', 'first'), ('WarmestSince', 'last')],
            'Record': [('CoolestRecord', "first"), ('WarmestRecord', 'last')]
        }).droplevel(1, axis=1)
          .reset_index()
       )

# Rename columns
df.columns = ['Period', 'Value', '1901-2000Mean', 'Anomaly', 
              'CoolestRank(1895-2023)', 'WarmestRank(1895-2023)', 'CoolestSince', 
              'WarmestSince', 'CoolestRecord', 'WarmestRecord']

print(df)

                  Period           Value  1901-2000Mean       Anomaly CoolestRank(1895-2023) WarmestRank(1895-2023)        CoolestSince        WarmestSince  CoolestRecord  WarmestRecord
0  February 2023 1-Month  56.6°F(13.7°C)  48.0°F(8.9°C)  8.6°F(4.8°C)          126th Coolest            4th Warmest  Coolest since:2022  Warmest since:2018           1895           2018
1   Jan–Feb 2023 2-Month  54.0°F(12.2°C)  46.5°F(8.1°C)  7.5°F(4.1°C)          127th Coolest            3rd Warmest  Coolest since:2022  Warmest since:2017           1978           1950

If you want to re-order columns, add this snippet to the code:

df = df[['Period', 'Value', '1901-2000Mean', 'Anomaly', 
         'CoolestRank(1895-2023)', 'CoolestSince', 'CoolestRecord', 
         'WarmestRank(1895-2023)', 'WarmestSince', 'WarmestRecord']]

                  Period           Value  1901-2000Mean       Anomaly CoolestRank(1895-2023)        CoolestSince  CoolestRecord WarmestRank(1895-2023)        WarmestSince  WarmestRecord
0  February 2023 1-Month  56.6°F(13.7°C)  48.0°F(8.9°C)  8.6°F(4.8°C)          126th Coolest  Coolest since:2022           1895            4th Warmest  Warmest since:2018           2018
1   Jan–Feb 2023 2-Month  54.0°F(12.2°C)  46.5°F(8.1°C)  7.5°F(4.1°C)          127th Coolest  Coolest since:2022           1978            3rd Warmest  Warmest since:2017           1950
Jamiu S.
  • 5,257
  • 5
  • 12
  • 34