0

Thanks to joran for helping me to group data in my previous question where I wanted to make a data frame in R smaller so that I can do time-series analysis on the data.

Now I would like to actually further extract data from the dataframe. The dataframe is made up of 6 columns. Columns 1 to 5 each have discrete names/values, such as a district, gender, year, month and age group. The sixth column is the number of death counts for that specific combination. An extract looks like this:

             District  Gender Year Month    AgeGroup TotalDeaths
             Northern    Male 2006    11        01-4           0
             Northern    Male 2006    11       05-14           1
             Northern    Male 2006    11         15+          83
             Northern    Male 2006    12           0           3
             Northern    Male 2006    12        01-4           0
             Northern    Male 2006    12       05-14           0
             Northern    Male 2006    12         15+         106
             Southern  Female 2003     1           0           6
             Southern  Female 2003     1        01-4           0
             Southern  Female 2003     1       05-14           3
             Southern  Female 2003     1         15+         136
             Southern  Female 2003     2           0           6
             Southern  Female 2003     2        01-4           0
             Southern  Female 2003     2       05-14           1
             Southern  Female 2003     2         15+         111
             Southern  Female 2003     3           0           2
             Southern  Female 2003     3        01-4           0
             Southern  Female 2003     3       05-14           1
             Southern  Female 2003     3         15+         141
             Southern  Female 2003     4           0           4

I am new to time-series, and I think I will need to do this to analyse the data: I will need to extract smaller 'time-series' data objects that are unique and longitudinal data. For example from this above dataframe, I want to extract smaller data objects like this for each District, Gender and AgeGroup:

             District  Gender Year Month    AgeGroup TotalDeaths
             Northern    Male 2003     1        01-4           0
             Northern    Male 2003     2        01-4           1
             Northern    Male 2003     3        01-4           0
             Northern    Male 2003     4        01-4           3
             Northern    Male 2003     5        01-4           4
             Northern    Male 2003     6        01-4           6
             Northern    Male 2003     7        01-4           5
             Northern    Male 2003     8        01-4           0
             Northern    Male 2003     9        01-4           1
             Northern    Male 2003    10        01-4           2
             Northern    Male 2003    11        01-4           0
             Northern    Male 2003    12        01-4           1
             Northern    Male 2004     1        01-4           1
             Northern    Male 2004     2        01-4           0

Going to

             Northern    Male 2006    11        01-4           0
             Northern    Male 2006    12        01-4           0

I tried something in Excel, creating pivot tables with this data, and then tried to extract the string of information - but failed. After that I discovered reshapein R, but I either don't know the codes or perhaps should not use reshape to do this.

I am not even certain if this is the correct/ way to analyse this cross-sectional time-series data, ie. if there is actually another format required to analyse this data with functions such as read.ts(), ts() and arima().

My eventual aim is to use this data and the amelia2 package with its functions to impute for missing TotalDeaths for certain months in 2007 and 2008, where the data is of course missing.

Any help, how to do this and perhaps suggestions on how to tackle this problem would be gratefully appreciated.

Community
  • 1
  • 1
OSlOlSO
  • 441
  • 7
  • 14
  • @OSIOISO. what time series analysis are u planning to run. take a look at the `plm` package. i wud believe that it wud be easier to run the analysis if u kept everything in a single data. if u provide more details on ur analysis, some of us might be able to help – Ramnath Jul 10 '11 at 12:47
  • @Ramnath, Perhaps I used 'time series analysis' incorrectly. I basically want to use the data from 2003-2009 (where some months in 2007 and 2008 have missing data) to impute for this missing months in 2007 & 2008. For this I've been unable to not use any R functions to just look at a seasonal and long term trend of the TotalDeaths. Thanks for pointing out plm - I should rather say the data is a panel data study. My problem now is, how to use this 'single data' - and get it read into R - for any time series analysis. Hope this clarifies. – OSlOlSO Jul 10 '11 at 15:28
  • Normally (from all the other time series questions on Stackoverflow) the time-series data contains just a sequence of date and numbers, such as in [this answer](http://stackoverflow.com/questions/6010362/r-attach-dates-to-time-series/6010417#6010417) - not in my data frame. – OSlOlSO Jul 10 '11 at 15:33
  • 1
    @OSIOISO. i still don't understand the end goal of your question. imputation is very tricky, more so in time-series. is imputing the missing values ur final goal? or are u planning to use the imputed data to conduct some other analysis? i would suggest that u clarify this in your question. maybe `stats.stackexchange` might be a better place to post this, if there is a significant statistical slant to what you are trying to do. – Ramnath Jul 10 '11 at 15:34
  • @OSIOISO. what you have is panel data as you have correctly pointed out. you can think of the extra variables you have in your data frame as explanatory variables that might explain some of the systematic variation in the time-series you are trying to study. – Ramnath Jul 10 '11 at 15:35
  • @Ramnath. My end goal would be to impute the missing values and then conduct an analysis to the feasibility of these imputations/estimations. Thanks for pointing out `stats.stackexchange` as I'm a newbie to the stacknetwork - I will definitely explore that avenue for help. Nonetheless, you've helped better explain and understand the type of data. Thanks Ramnath. – OSlOlSO Jul 10 '11 at 16:18

1 Answers1

0

For the narrow question of how to best extract:

subset(dfrm, subset=(District=="Northern" &  Gender=="Male" &  AgeGroup=="01-4"))

subset also has a select argument to narrow down the columns. I suspect a search on the term "extract" you were using would have only pulled up hits for the ?Extract page which surprisingly has no link to subset. (I trimmed a trailing space from an earlier version of the AgeGroup specification.)

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks for the help @DWin. I actually have previously tried to use `subset` - but discontinued trying it as I kept on getting an error. For some reason, I'm getting the same error as previously: `[1] District Gender Year Month AgeGroup TotalDeaths <0 rows> (or 0-length row.names)` I tried other combinations such as using 'Eastern' as district and 'Female', but it kept on giving the above output. Do you perhaps know why it is not working? – OSlOlSO Jul 10 '11 at 16:21
  • The trailing spaces in the AgeGroup specification above may not match the spelling of your variable. See if trimming them helps. – IRTFM Jul 10 '11 at 16:28
  • Ugh - I'm still having a problem. I've successfully used `subset`, but when I include `District` in the subset formula, it gives that error `<0 rows> (or 0-length row.names)`. _Viz._ this works: `head(subset(data0306t, Year=="2004" & Month=="8" & Age.Group =="0"))` but not this `head(subset(data0306t, District=="Eastern" & Age.Group =="0"))` – OSlOlSO Jul 10 '11 at 17:44
  • That's not an error, just a 0 row data.frame. Note your spelling of the Age.Group column is different than what you posted ... So you need to post something like `str()` or `dput(head(data0306t))` . – IRTFM Jul 10 '11 at 19:20
  • Still not working. I'll try alternative things. I'm making sure everything, spelling and specific dataframes(because I have different ones with the same data) are correct. But thanks for the help @DWin – OSlOlSO Jul 11 '11 at 11:31