1

I am having a .csv file with daily data, as follows:

some 19 more header rows
Werte
01.01.1971 07:00:00   ;     0.0
02.01.1971 07:00:00   ;     1.2
...and so on

which I import with:

RainD=pd.read_csv('filename.csv',skiprows=20,sep=';',dayfirst=True,parse_dates=True)

As a result, I get

In [416]: RainD
Out[416]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 14976 entries, 1971-01-01 07:00:00 to 2012-01-01 07:00:00
Data columns:
Werte:    14976  non-null values
dtypes: object(1)

So its a a Dataframe, but maybe a Timeseries might be the right way? But how do I import it as such? The pandas documentation list a dtype option in read_csv, but no info on what I can/should specify.

But on the other hand, the DatetimeIndex: seems to me like pandas is quite aware of the fact that i deals with Dates here, but still makes it a Dataframe. And for that, something like RainD['1971'] just results in an u'no item named 1971' Key error.

I have the feeling that I am just missing something really obvious, since time series analysis seems to be THE thing pandas was made for.

Another first idea of mine was that pandas might get confused by the fact that the dates are written in the correct (ie dd.mm.yyyy ;) ) way, but a RainD.head() shows me that i could digest that just fine.

Regards JC

JC_CL
  • 2,346
  • 6
  • 23
  • 36
  • 1
    The reason that you index selection fails is because you are trying to access an index label or column with a string `'1971'` which will not work, if you wanted to filter the df to find index values where the year is `1971` then following would work: `df[df.index.year == 1971]`, – EdChum Jan 19 '15 at 13:29
  • You may be confusing the indexing semantics with a [time series indexing](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetimeindex-partial-string-indexing) which is entirely different – EdChum Jan 19 '15 at 13:31
  • Yes, I am still confusing a lot of things ;) … But for now, the `df[df.index.year == 1971]` did reduce my confusion quite a lot! Thanks! But maybe one additional thing, before I consider this answered: What then, in this case, is the difference between a Dataframe and a Timeseries? Or asked another way: is this the correct way to do it, or rather a crude hack, that'll soon cause me to run into other issues? – JC_CL Jan 19 '15 at 13:45
  • So did my comment answer your question? – EdChum Jan 19 '15 at 13:46
  • You get a DataFrame because `read_csv` *always* returns a DataFrame. If you want it as a Series, you can select the one column with `RainD['Werte']` (and by the way, a TimeSeries is not something special, it is just a (not used anymore) alias for a Series with a DatetimeIndex). – joris Jan 19 '15 at 14:04
  • Normally `RainD['1971']` *should* work (that is called partial string indexing, see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetimeindex-partial-string-indexing). What version of pandas are you using? Does ``RainD.loc['1971']`` work? – joris Jan 19 '15 at 14:09
  • `RainD.loc['1971']` results in `AttributeError: 'DataFrame' object has no attribute 'loc'`. Seems like fedora still has not updated pandas. Just checked, I am on 0.10. so much for fedora being cutting edge and all that… – JC_CL Jan 19 '15 at 14:26
  • Ah, that clarifies. You should really try to update your pandas version. – joris Jan 19 '15 at 15:23

1 Answers1

1

EdChum's df[df.index.year == 1971] solved my issue.

I might have some other issues (ie outdated version of pandas), but for now, I can continue working.

JC_CL
  • 2,346
  • 6
  • 23
  • 36