0

Given a simplified example time series looking at a population by year

Year<-c(2001,2002,2003,2004,2005,2006)
Pop<-c(1,4,7,9,20,21)
DF<-data.frame(Year,Pop)

What is the best method to test for significance in terms of change between years/ which years are significantly different from each other?

Vinterwoo
  • 3,843
  • 6
  • 36
  • 55
  • 2
    This is more of a statistics question, and not a programming question. – joran Feb 10 '13 at 04:05
  • Sorry for miss-post. Is it possible to port this question to the stats page? Or is that bad etiquette to cross post? – Vinterwoo Feb 10 '13 at 04:56
  • You can flag for migration. That is a moderator-controlled process. For some screwball reason SO limits the number of moderation flags one can use, even if the moderator agrees that the flag is valid, so I've tried to limit my use of them. – IRTFM Feb 10 '13 at 06:54

2 Answers2

6

As @joran mentioned, this is really a statistics question rather than a programming question. You could try asking on http://stats.stackexchange.com to obtain more statistical expertise.

In brief, however, two approaches come to mind immediately:

  1. If you fit a regression line to the population vs. year and have a statistically significant slope, that would indicate that there is an overall trend in population over the years, i.e. use lm() in R, like this lmPop <- lm(Pop ~ Year,data=DF).
  2. You could divide the time period into blocks (e.g. the first three years and the last three years), and assume that the population figures for the years in each block are all estimates of the mean population during that block of years. That would give you a mean and a standard deviation of the population for each block of years, which would let you do a t-test, like this: t.test(Pop[1:3],Pop[4:6]).

Both of these approaches suffer from some potential difficulties and the validity of each would depend on the nature of the data that you're examining. For the sample data, however, the first approach suggests that there appears to be a trend over time at a 95% confidence level (p=0.00214 for the slope coefficient) while the second approach suggests that the null hypothesis that there is no difference in means cannot be falsified at the 95% confidence level (p = 0.06332).

Simon
  • 10,679
  • 1
  • 30
  • 44
  • @Spacedman's answer explains nicely why the two analysis approaches suggested above give different answers: They're answering different questions based on different models of the data. One approach models the data as a trend over time and tests for a significant trend, while the other models the data as two groups of points, separated in time, and tests for a difference in the means of the two groups of points. – Simon Feb 10 '13 at 19:29
4

They're all significantly different from each other. 1 is significantly different from 4, 4 is significantly different from 7 and so on.

Wait, that's not what you meant? Well, that's all the information you've given us. As a statistician, I can't work with anything more.

So now you tell us something else. "Are any of the values significantly different from a straight line where the variation in the Pop values are independent Normally distributed values with mean 0 and the same variance?" or something.

Simply put, just a bunch of numbers can not be the subject of a statistical analysis. Working with a statistician you need to agree on a model for the data, and then the statistical methods can answer questions about significance and uncertainty.

I think that's often the thing non-statisticians don't get. They go "here's my numbers, is this significant?" - which usually means typing them into SPSS and getting a p-value out.

[have flagged this Q for transfer to stats.stackexchange.com where it belongs]

Spacedman
  • 92,590
  • 12
  • 140
  • 224