-1

My data frame has a 10 columns and 100,000 rows, each row is an observation and the columns are data pertaining to each observation. One of the columns has the date of an observation in the julian day(ie feb 4= day 34). I want to reduce my data set so I'd have the first 10% observations PER year PER species. Ie, for species 1 in the year 1901 I want the average day of appearance based on the first 10% of observations.

Example of what I have: note id= species but as a number. ie blue=1

date=c(3,84,98,100,34,76,86...)
species=c(blue,purple,grey,purple,green,pink,pink,white...)
id=c(1,2,3,2,4,5,5,6...)
year=c(1901,2000,1901,1996,1901,2000,1986...)  
habitat=c(forest,plain,mountain...)

ect What i want: date=c(3,84,76,86...) species=c(purple,pink,pink, white...) id=c(2,5,5,6...) year=c(1901,2000,2000,1986...)
habitat=c(forest,plain,mountain...) new=c(3,84,79,86...)

John
  • 59
  • 1
  • 9

2 Answers2

3

Assuming the data set dd defined below

set.seed(123)
n <- 100000
dd <- data.frame(year = sample(1901:2000, n, replace = TRUE), 
                 date = sample(0:364, n, replace = TRUE),
                 species = sample(1:5, n, replace = TRUE))

1) base Aggregate dd with the indicated function. No packages are used:

avg10 <- function(date) {
  ok <- seq_along(date) <= length(date) / 10
  if (any(ok)) mean(date[ok]) else NA
}
aggregate(date ~ species + year, dd, avg10)

2) data.table Here is a data.table solution:

data.table(dd)[, 
  {ok <- .I <= .10 * .N; if (any(ok)) mean(date[ok]) else NA}, by = "species,year"]

Note: If you don't want NA's then use this instead of either of the if statements above to get the first point in that case:

  if (any(ok)) mean(date[ok]) else date[1]
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • This seems to be close...except some values returned back as "NA" what does that mean and I lost all the other columns with corresponding information for each observation. – John Mar 30 '15 at 18:50
  • If there are too few rows then there will not be enough to get 10% so it returns NA. This works with the data you provided. – G. Grothendieck Mar 30 '15 at 18:54
  • Would it not be best to first shrink the data set down so only the first 10% of observations per species per year is present and then take the mean of those dates for each year per species? – John Mar 30 '15 at 18:56
  • Then you would have to split it twice. – G. Grothendieck Mar 30 '15 at 19:45
  • Have added Note in case you don't want NAs and want the first observation instead. – G. Grothendieck Mar 30 '15 at 19:48
1

Just as for your last question, dplyr may work well for you:

Some data:

library(dplyr)
set.seed(42)
n <- 500
dat <- data.frame(date = sample(365, size=n, replace=TRUE),
                  species = sample(5, size=n, replace=TRUE),
                  year = 1980 + sample(20, size=n, replace=TRUE))

How it looks without filtering:

dat %>% group_by(year, species) %>% arrange(year, date)
## Source: local data frame [500 x 3]
## Groups: year, species
##    date species year
## 1    50       1 1981
## 2   138       1 1981
## 3   174       1 1981
## 4   179       1 1981
## 5   200       1 1981
## 6   332       1 1981
## 7    31       2 1981
## 8    52       2 1981
## 9   196       2 1981
## 10  226       2 1981
## ..  ...     ...  ...

How it looks with the first 10% by date within each year:

dat %>%
    group_by(year, species) %>%
    filter(ntile(date, 10) == 1) %>%
    arrange(year, date)
## Source: local data frame [100 x 3]
## Groups: year, species
##    date species year
## 1    50       1 1981
## 2    31       2 1981
## 3    63       3 1981
## 4   112       4 1981
## 5     1       5 1981
## 6    40       1 1982
## 7   103       2 1982
## 8    40       3 1982
## 9    86       4 1982
## 10   48       5 1982
## ..  ...     ...  ...

I think the ntile trick is doing what you want: it breaks the data into roughly equal-sized bins, so it should be giving you the lowest 10% of your dates.

EDIT

Sorry, I missed the mean in there:

dat %>% group_by(year, species) %>%
    filter(ntile(date, 10) == 1) %>%
    summarise(date = mean(date)) %>%
    arrange(year, date)
## Source: local data frame [99 x 3]
## Groups: year
##    year species date
## 1  1981       5    1
## 2  1981       2   31
## 3  1981       1   50
## 4  1981       3   63
## 5  1981       4  112
## 6  1982       1   40
## 7  1982       3   40
## 8  1982       5   48
## 9  1982       4   86
## 10 1982       2  103
## ..  ...     ...  ...
Community
  • 1
  • 1
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Is it possible to have a value returned for each observation which would be the mean day of the first 10% per species per year, which would generate a new column entirely? and just eliminate data that doesn't fall within the first 10% of the observations? – John Mar 30 '15 at 18:35
  • This is wonderful! lat question: how would I eliminate the observations from the data set that fall outside of the initial 10%? ie I want to eliminate rows that were not used in the calculation of the mean date. Thank you so much!! – John Mar 30 '15 at 19:31
  • That's what the `filter()` part is doing: allowing only the first 10% to continue. You need to capture the output into a new variable with something like `datFiltered <- dat %>% group_by(year, species) %>% ...`. – r2evans Mar 30 '15 at 19:33
  • Okay, so the new data frame would have less rows( just because there's the same amount in your before and after) and all my additional columns will follow along accordingly( i mentioned i have 10 columns in total)? – John Mar 30 '15 at 19:39
  • 500 rows in my before, 100 in my forgot-the-mean, and 99 in my edited-with-the-mean. Have you tried it with your data yet? – r2evans Mar 30 '15 at 19:40
  • I just tried it and your version only returns the three mentioned columns but I lost all the additional information for each observation – John Mar 30 '15 at 20:02
  • You never mentioned other columns. Perhaps you can redefine what your desired output would look like, and provide a more representative working example data.frame? (Plus, if you are summarizing multiple rows of one column with a mean, what do you propose we do with all of the other fields?) – r2evans Mar 30 '15 at 20:06
  • I apologize for any confusion. My data frame has 10 columns and 100,000 rows, where each row is an observation of a species. I am interested in the initial appearance of each species for each year but only care about the first 10%. Therefore, I want to rid of the observations(rows) that are not part of the first 10% of the observations for a given species for a given year but also want to maintain all relevant information for that species (columns) if they are part of the 10% – John Mar 30 '15 at 20:14
  • How do you propose to reduce the data with `mean` yet keep all the other data intact? Do you want all of `date` to be replaced with the appropriate `mean(date)`, keeping the number of rows the same? Or is there another summary step you need to do with the other data? Are they qualitative or quantitative? Best to adjust your question with a few rows of more-complete-looking data and your suggested output, since I'm confused about how you propose to summarize things. – r2evans Mar 30 '15 at 20:20
  • Okay so i tried to fix the question but I want to !) reduce the number of rows so that only observations the make up the first 10% for a given year, for a species are present. Then I would like to create a new column that has the mean date of that 10% per year per species. So say I have 20 observations in 1901 for the blue species, I'd want my new subset to have only two observations (20*.1=2) of the earliest species and an average of whatever those days were. – John Mar 30 '15 at 20:45
  • Okay, but what do you want to do with the remaining columns? It won't makes sense to try `mean(id)` or even `mean(habitat)`, so the remaining columns do not make sense to retain (unless you also filter PER year PER species PER habitat ...). – r2evans Mar 30 '15 at 20:52
  • no, i wouldn't want means of the habitat or so forth, i want all of the original info that matches that observations. ex: if row one falls in the first 10% of observations i want to keep it as is except have a new value in a new column with gives the means date of that species appearance for that year. – John Mar 30 '15 at 21:16
  • if a row does not fall in the first 10% of observations, say it was seen in November 25th, it want to eliminate that entire row all together. essentially, i want an identical version of my input but with less rows and a new column that has the mean date of appearance for that species for that year. I'm so so sorry if i'm not coming across clearly...basically all pink species in 1901 will have the same mean appearance date by some may have been seen in a forest while other in a park so the information in the columns will still remain unique to them. – John Mar 30 '15 at 21:16
  • Soooooo ... once you reduce your data set to the first 10% (date-wise by year and species), you want *no further reduction in rows*. Correct? Perhaps try changing `summarise(date = mean(date))` to `mutate(date2 = mean(date))` and see if that's closer to what you want. – r2evans Mar 30 '15 at 21:22
  • So it's looking better, but i'm still missing the other columns...could i add them in by including them in the " arrange(year, date)" ie" arrange (year, date,habitat,id). Also i know for example my 1901 doesn't have enough rows to do 10% , so I want it to return NA like the guys above me did. – John Mar 30 '15 at 22:05