0

I have this dataframe, "Data", containing one full year of data collected about every half-hour, but for some days only a few hours of data were collected.

Dates are in the format: 31.01.2010 00:30 (all in one cell) Variables are: Temperature, humidity, PM10, windspeed, etc.

First question: How can I calculate the daily means, medians, max, min, values of these variables, so I can test each of them in further analysis such as survival analysis with GAM),instead of the hourly/half-hourly data?

Obviously, the calculated daily average/median should be assigned to its corresponding date.

Second question: the DATES column contains both date and time together, separated by one space in the same cell. in R, its type is 'Factor' and I cannot do any calculations, because the error "dates" is missing, appears.

My guess is that I need to convert it first from Factor into date/time so it can be recognized and then to calculate means/medians. But how do I do this?

Can you please indicate what would be the arguments/functions to use?

I think that I have solved the conversion of date from 'Factor' to POSIXlt: I used the function strptime (Data$DATES, format="%d.%m.%Y %H:%M") and now $DATES are recognized as POSIXlt, format "2010-01-01 00:00:00" ....

But I still need to find the function that calculates daily means or averages or medians or whatever.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
Max DeV.
  • 13
  • 1
  • 4
  • I think that I have solved the conversion of date from 'Factor' to POSIXlt: I used the function strptime(Data$DATES, format="%d.%m.%Y %H:%M") and now $DATES are recognized as POSIXlt, format "2010-01-01 00:00:00" .... – Max DeV. Dec 17 '15 at 16:17
  • Removed 'survival analysis" and 'gam' tags since neither of them applied to the question. – IRTFM Dec 19 '15 at 23:23

2 Answers2

0

First, convert your time series into a xts object. Then compute the data you want using xts functions such as apply.daily() See, the xts vignette here.

I feel that the following snippet should work:

# Load library xts
require(xts)

# Create example dataframe 
datetime <- c('31.01.2010 00:30', '31.01.2010 00:31', '31.01.2010 10:32', '01.02.2010 10:00', '01.02.2010 11:03', '01.03.2011 08:09', '01.03.2011 21:00', '01.03.2011 22:00')
value <- c(1.5, 2, 2.5, 7, 3.5, 9, 4.5, 7.5)
df <- data.frame(datetime, value)

# Create xts object
df.xts <- as.xts(df[,2], order.by=as.Date(df[,1], format='%d.%m.%Y %H:%M'))

# Daily mean
d.mean <- apply.daily(df.xts, mean)

# Daily median
d.median <- apply.daily(df.xts, median)

# Daily min
d.min <- apply.daily(df.xts, min)

# Daily max
d.max <- apply.daily(df.xts, max)

(alternatively, see RFiddle)

tagoma
  • 3,896
  • 4
  • 38
  • 57
  • edouard, I checked the xts vignette, and found nothing about median, means, or other common statistics; are you sure this package can do that? Anyways, the option below, proposed by xgord works, though. – Max DeV. Dec 21 '15 at 18:23
  • hello, i edited my answer just now. please see above. (hopefully i got what you were after) – tagoma Dec 21 '15 at 22:16
-1

There are several parts to the problem. Before calculating the median statistics, you need to massage the dataframe so that it has the appropriate types.

For these explanations I'm going to assume you have a dataframe named dt.


Part 1: Converting the datatypes of the dataframe

date factor to datetime StackOverflow

datetime POSIXct conversion StackOverflow

First you need to convert the Date column from the factor type to the datetime type.

dt$Date <- strptime(x = as.character(dt$Date),
                    format = "%d.%m.%Y %H:%M")

dt$date_alt <- as.POSIXct(dt$date_alt) # convert the format of datetime to allow use with ddply

Then, since I'm assuming you want the median statistics by day-month-year, not including the time, we'll need to extract that info. You'll want to put it in a new column to preserve the times.

dt$date_alt <- strptime(x = as.character(dt$Date),
       format = "%d.%m.%Y")


Part 2: Calculating summary statistics grouped by a particular field

Now that we have the dataframe looking the way we want it, you can calculate the average statistics grouped by the day-month-year, which in our case is the date_alt column.

The plyr package provides a really nice funciton for doing this: ddply

library(plyr) # need this library for the plyr call

summ <- ddply(dt, .(date_alt), summarize, 
              med_temp = median(!is.na(Temperature)),
              mean_temp = mean(!is.na(Temperature)), # you can also calc mean if you want
              med_humidity = median(!is.na(humidity)),
              med_windspeed = median(!is.na(windspeed))
              # etc for the rest of your vars
          )


Breaking down the ddply call:

ddply cookbook explanation

ddply is essentially a function which acts over a dataframe. Here's a breakdown of the arguments to the function call:

  1. dt -- the name of the dataframe you want to iterate over
  2. .(date_alt) -- the names of the columns you want to group by. Conceptually, this splits the dataframe up into a bunch of subdataframes whose rows consist of rows from the original dataframe which share the same values in the columns listed in the parentheses.
  3. summarize -- this tells the ddply call that you want to calculate aggregate statistics on the subdataframes
  4. med_temp = median(Temperature) and all similar lines -- defines a column in the result data frame. this says that you want the new dataframe to have a column called med_temp that contains the median(Tempurature) results for each sub-dataframe. Keep in mind that instead of median you can use whatever function you want for aggregate values.
Community
  • 1
  • 1
xgord
  • 4,606
  • 6
  • 30
  • 51
  • Thanks xgord for the valuable information. I have created the alternative column for dates as per your instruction. I will now work on the ddply function, and will post latter, as I need a bit of time to figure out the correct syntax. – Max DeV. Dec 17 '15 at 16:41
  • xgord, I have tried your code; seems there is a problem. I got as output after >summ DATE_ALT med_pm10 mean_pm10 med_temp mean_temp 1 1 1 1 1 Any ideas of what could be wrong? Thanks. – Max DeV. Dec 17 '15 at 17:14
  • I forgot exactly waht it said but something like there is one row only...I can't reproduce the error. – Max DeV. Dec 17 '15 at 17:20
  • Sometimes there can be an issue if all of the datetime columns in your dataframe aren't of type `POSIXct`. Make sure when you run `str(dt)` that the posix columns are `POSIXct` and not any other type (ex. `POSIXlt`). Use the `dt$colname <- as.POSIXct(dt$colname)` command with whichever columns aren't the right type. – xgord Dec 17 '15 at 17:22
  • I got it Error: length(rows) == 1 is not TRUE – Max DeV. Dec 17 '15 at 17:22
  • Yes, the dates are recognized as POSIXlt, not as POSIXct as you reccomend. I will try again and post latter. Thanks. – Max DeV. Dec 17 '15 at 17:24
  • another thing to check: make sure inside the `ddply` call you're using `summarize` and ***not*** `summary` -- there's a difference between those 2 functions – xgord Dec 17 '15 at 17:29
  • Well, I changed both the DATE and the DATE_ALT as POSIXct, and tried again. The same error came: Error: length(rows) == 1 is not TRUE. Does it mean that the DATE_ALT column contains only one row? just my wild guessing, lol. I have no idea what it may signify – Max DeV. Dec 17 '15 at 17:33
  • see previous comment about `summarize` vs `summary`. also what does your `ddply` call look like? – xgord Dec 17 '15 at 17:35
  • Yes I saw it. I am using the summarize function. As for the ddply, it seems to be working. One detail, It warned when I loaded the plyr package, saying that I should have loaded first cause ddply was already loaded. Does it make sense? Maybe I should start again from fresh R session? – Max DeV. Dec 17 '15 at 17:38
  • Where / during which commands is the error being thrown? Also can you put your ddply call here? – xgord Dec 17 '15 at 17:46
  • xgord, sorry for the delay. I figured out where the problem was: One problem was the code: dt$date_alt <- strptime(x = as.character(dt$Date), format = "%d.%m.%Y") This was wrong. It had to be as.Date. I corrected it with this code: dt$date_alt <- as.Date(dt$Date, format="%d.%m.%Y"). Then, the other problem was in the ddply code you proposed; for example, instead of writing the code: med_temp=median(!is.na(Temperature)), I wrote this code as: median=median(Temp), and the same for the others. So after >summ I get the matrix with dates listed by days and the columns of means and medians. – Max DeV. Dec 21 '15 at 18:16