0

I have a dataframe in R where each row corresponds to a household. One column describes a date in 2010 when that household planted crops. The remainder of the dataset contains over 1000 columns describing the temperature on every day between 2007-2010 for those households.

This is the basic form:

        Date 2007-01-01 2007-01-02 2007-01-03
1 2010-05-01         70         72         61
2 2010-02-10         63         59         73
3 2010-03-06         60         59         81

I need to create columns for each household that describe the monthly mean temperatures of the two months following their planting date in each of the three years prior to 2010.

For instance: if a household planted on 2010-05-01, I would need the following columns:

  • mean temp of 2007-05-01 through 2007-06-01
  • mean temp of 2007-06-02 through 2007-07-01
  • mean temp of 2008-05-01 through 2008-06-01

...

  • mean temp of 2009-06-02 through 2009-07-01

I skipped two columns, but you get the idea. Specific code would be most helpful, but in general, I am just looking for a way to pull data from specific columns based upon a date that is described by another column.

bricevk
  • 197
  • 8
  • Is there a column that uniquely identifies household or safe to assume the `Date` column does it? How many households (rows) are there? – Chuck P May 07 '20 at 16:26
  • @ChuckP There is a column that uniquely identifies households, it is "hhid". My bad for not including it in my little example data frame above – bricevk May 07 '20 at 17:15

2 Answers2

1

Hi @bricevk you could use the apply function. It allows you to use a function over a data either column-wise or row-wise.

https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/apply

Say your data is in a object df. It applies the mean function over the columns of df . Giving you the column-wise mean. The 2 indicates the columns. This wpuld the daily average, assuming each column, is a day.

Averages <- apply(df,2,mean)

If I didn't answer this the way you would like perhaps I have not really understood your dataset. Could you try explain it more clearly?

user4933
  • 1,485
  • 3
  • 24
  • 42
0

I suggest you to use tidyverse. However, in order to be compatible with this universe, you firstly have to make your data standard, ie tidy. In your example, the things would be easier if you transformed your data in order to have your observations ordrered by rows, and columns being variables. If I correctly understood your data, you have households planting trees (the row names are dates of plantation ?), and then controls with temperature. I'd do something like :

    -----------------------------------------------------------------------------
    | Household ID | planting date | Date of control | Temperature controlled |
    -----------------------------------------------------------------------------

firstly, have your planting date stored as another thing than a rowname, by example :

     library(dplyr)
     df <- tibble::rownames_to_column(data, "PlantingDate")

You also have to get your household id var you haven't specified to us. Then you can manage to have the tidy data with tidyr, using

     library(tidyr)
     df <- gather(df,"DateOfControl","Temperature",-c(PlantingDate,ID))

When you'll have that, you'll be able to use the package lubridate, something like

     library(lubridate)
     df %>%
     group_by(ID,PlantingDate,year(ControlDate),month(ControlDate)) %>%
     summarise(MeanT=mean(Temperature))

could work

Arnaud Feldmann
  • 761
  • 5
  • 17