1

I have a csv file that I'm reading, cleaning, and analysing with pandas. I select the relevant data and then create a list of means for each column (which I then use as the new data for a new dataframe). Everything seems to work - however, when I double-checked the data against the values of the means/averages "manually" computed in Excel, the pandas and Excel values were different.

The csv file I'm using can be found here: https://drive.google.com/open?id=1TPczQoh1oS-RaRpepd4evxM919699Dss. The original file is from https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/aberporthdata.txt ; the first link is just the cleaned and prepped version.

months = [3,4,5]
l = []
for j, station in enumerate(stations):
    df = pd.read_csv('/Users/Ji/Documents/' + station + 'data_clean.csv')
    df = df.drop('empty', axis=1).replace('---', np.nan)
    df = df.loc[df['mm'].isin(months)]
    df['station'] = station

    df = df.astype({'mm': np.int32,'tmax': np.float32,'tmin': np.float32,'af': np.float32,'rain': np.float32,'sun': np.float32, 'station': np.str})
    df = df.drop(['mm','yyyy'], axis=1)

    row = [0]*6
    for i, col in enumerate(list(df)):
        if col == 'station':
            row[5] = station
            continue
        row[i] = df[col].mean(skipna=True)

    l.insert(j, row)

df_means = pd.DataFrame(data=l, columns=list(df))

The means I got for this specific file in pandas were:

        tmax      tmin         af        rain        sun          station  
0   7.582970  3.190000   4.924325   84.921890  61.074783        aberporth

And the averages I got in Excel were:

tmax            tmin            af              rain            sun
12.38645949     7.193654267     1.576294278     75.78479784     129.2139254

I'd appreciate any ideas or explanations as to why this is and how to fix it!

jihaneaz
  • 25
  • 3
  • 4
    Post a [mcve]. Links to data are not helpful. – cs95 Jun 11 '18 at 14:54
  • @coldspeed I'm not sure what you mean - I explained it as well and succinctly as I could and copied the results I got. I only included a link to my data just in case someone wished to test it, since the problem involves both pandas and Excel. – jihaneaz Jun 11 '18 at 15:02
  • Maybe as a double-check, save `df_means` to an excel file and check the averages. – A.Kot Jun 11 '18 at 18:05
  • Different processing rules of the missing values I venture – Severin Pappadeux Jun 11 '18 at 18:20
  • This is an unintentional behavior of pandas. See following: https://stackoverflow.com/questions/55955242/the-result-of-dataframe-mean-is-incorrect – Marko Tankosic Oct 02 '19 at 05:58

1 Answers1

0

This outputs the same values I got from Excel.

import pandas as pd
import numpy as np

df = pd.read_csv('C:\orig.csv')
df = df.drop('empty', axis=1)
df.replace('---', np.nan, inplace=True)

for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = pd.to_numeric(df[col], downcast='float')

mean_vals = df.mean()

mean_vals
Out[44]: 
yyyy    1979.209903
mm         6.481163
tmax      12.411418
tmin       7.188440
af         1.578019
rain      75.767384
sun      129.306442
dtype: float64

# Output from Excel
1979.21  6.48   12.41   7.19    1.58    75.77   129.31
KRKirov
  • 3,854
  • 2
  • 16
  • 20