0

I have a data set which contains 12 years of weather data. For first 10 years, the data was recorded per day. For last two years, it is now being recorded per week. I want to use this data in Python Pandas for analysis but I am little lost on how to normalize this for use.

My thoughts

  1. Convert first 10 years data also into weekly data using averages. Might work but so much data is lost in translation.
  2. Weekly data cannot be converted to per day data.
  3. Ignore daily data - that is a huge loss
  4. Ignore weekly data - I lose more recent data.

Any ideas on this?

Guru
  • 148
  • 1
  • 12
  • It depends: what are you trying to achieve? – Andy Hayden Oct 20 '17 at 01:29
  • I am going to use this data to create a prediction model. – Guru Oct 20 '17 at 01:33
  • This is really not a programming question. You should ask this on https://stats.stackexchange.com/ – DJK Oct 20 '17 at 01:58
  • If the current data is being recorded weekly, then the prediction model has to be weekly in order to do any kind of verification. To increase your sample size you need to convert the first 10 years to weakly data. What kind of data is it? I am also an atmospheric scientist. – BenT Oct 20 '17 at 02:23

1 Answers1

0

First, you need to define what output you need, then, deduce how to treat the input to get the desired output.

Regarding daily data for the first 10 years, it could be a possible option to keep only one day per week. Sub-sampling does not always mean loosing information, and does not always change the final result. It depends on the nature of the collected data: speed of variations of the data, measurement error, noise.

Speed of variations: Refer to Shannon to decide whether no information is lost by sampling once every week instead of every day. Given that for the 2 last year, some people had decided to sample only once every week, it seems to say that they have observed that data does not vary much every day and that a sample every week is enough information. That provides a hint to vote for a final data set that would include one sample every week for the total 12 years. Unless they reduced the sampling for cost reason, making a compromise between accuracy and cost of doing the sampling. Try to find in the literature a what speed your data is expected to vary.

Measurement error: If the measurement error contains a small epsilon that is randomly positive or negative, then, taking the average of 7 days to make a "one week" data will be better because it will increase the chances to cancel this variation. Otherwise, it is enough to do a sub-sampling taking only 1 day per week and throwing other days of the week. I would try both methods, averaging, and sub-sampling, and see if the output is significantly different.

B.D.
  • 11
  • 2