1

I would like to use the subset function in R to extract smaller groups of panel study time series data.

My data consists of a dataframe made up of six columns: district(8 districts), gender, age interval(4 groups), year, month and a count column.

Example:

  District Gender Year Month AgeGroupNew TotalDeaths
1 Eastern  Female 2003     1           0           4
2 Eastern  Female 2003     1        01-4           1
3 Eastern  Female 2003     1       05-14           1
4 Eastern  Female 2003     1         15+          91
5 Eastern  Female 2003     2           0           4
6 Eastern  Female 2003     2        01-4           1

I would like to extract smaller subset for each district, Gender and age interval to get something like this:

     District  Gender Year Month AgeGroupNew TotalDeaths
     Northern    Male 2003     1        01-4           0
     Northern    Male 2003     2        01-4           1
     Northern    Male 2003     3        01-4           0
     Northern    Male 2003     4        01-4           3
     Northern    Male 2003     5        01-4           4
     Northern    Male 2003     6        01-4           6
     Northern    Male 2003     7        01-4           5
     Northern    Male 2003     8        01-4           0
     Northern    Male 2003     9        01-4           1
     Northern    Male 2003    10        01-4           2
     Northern    Male 2003    11        01-4           0
     Northern    Male 2003    12        01-4           1
     Northern    Male 2004     1        01-4           1
     Northern    Male 2004     2        01-4           0

Going to

     Northern    Male 2006    11        01-4           0
     Northern    Male 2006    12        01-4           0

So far I have been trying to use this, thanks to DWin pointing it out in a previous question.

subset(datNew, subset=(District=="Eastern" &  Gender=="Female" &  AgeGroupNew=="01-4"))
[1] District    Gender      Year        Month       AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)

But R keeps on giving me the output as above - which it shouldn't.

I have tried other combinations with success, but it seems using 'District' in the subset causes this <0 rows> (or 0-length row.names).

This works:

> head(subset(datNew, Year=="2004" & Month=="8" & AgeGroupNew =="0"))
         District Gender Year Month AgeGroupNew TotalDeaths
77       Eastern  Female 2004     8           0          10
269      Eastern    Male 2004     8           0           6
461  Khayelitsha  Female 2004     8           0          13
653  Khayelitsha    Male 2004     8           0          15
845  Klipfontein  Female 2004     8           0           7
1037 Klipfontein    Male 2004     8           0           6

but not

> head(subset(datNew, District=="Eastern" & Gender=="Female" & AgeGroupNew =="0"))
[1] District    Gender      Year        Month       AgeGroupNew TotalDeaths
<0 rows> (or 0-length row.names)

Any reason why District is causing this? It's absolutely wrong that there are 0 rows with that combination of the subset - there's enough data to my knowledge.

I've tried experimenting - and from other posts, this is a baby step closer to what I want to achieve, but still not working:

> head(subset(datNew,datNew[[1]] %in% District[1] & Gender=="Female" & AgeGroupNew=="0"))
   District Gender Year Month AgeGroupNew TotalDeaths
1  Eastern  Female 2003     1           0           4
5  Eastern  Female 2003     2           0           4
9  Eastern  Female 2003     3           0           5
13 Eastern  Female 2003     4           0          12
17 Eastern  Female 2003     5           0           7
21 Eastern  Female 2003     6           0          13

With this I am unable to choose from the other Districts, such as "Southern", "Khayelitsha", etc. No matter what I change datNew[[1 or 2 or 3]] and District[[1 or 2 or 3]]. I don't really know what %in% does above?

I am so stuck. Any help asseblief.

Community
  • 1
  • 1
OSlOlSO
  • 441
  • 7
  • 14
  • I am guessing that your District data contains spaces at the end of each string. Have a look at the right alignment of your samples above. Gender is aligned with the "r", but District is aligned one to the left. Inspect your data for empty space. – Andrie Jul 11 '11 at 14:33
  • Weird! Perhaps you could consider switching the districts to a number, instead? – Christian Bøhlke Jul 11 '11 at 14:49

1 Answers1

2

Prediction: Give us the results str(datNew$District[1]) and all will be revealed. I predict there is a non-printing character that will show up, perhaps a trailing space (or two).

So with the results of str(...) the correct code would be:

subset(datNew, District=="Eastern " & Gender=="Female" & AgeGroupNew =="0")
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • This is the results. `> str(datNew$District[1]) Factor w/ 8 levels "Eastern ","Khayelitsha ",..: 1` – OSlOlSO Jul 11 '11 at 16:39
  • So I was right. You are misspelling the "Eastern " factor level. – IRTFM Jul 11 '11 at 16:46
  • Okay - I see the trailing space :) So the `subset` function works now. Thanks @DWin @Andrie @ChristianBøhlke for the help. I probably waste 8 hours because of a stupid trailing space. – OSlOlSO Jul 11 '11 at 16:48
  • Un related question - how does one "solve" this question. Do I mark yours as the correct answer or should I add an "Answer Your Question" with the correct code with the trailing spaces included? (Just want to keep the Stackoverflow system clean and working.) – OSlOlSO Jul 11 '11 at 16:54
  • I'll post it. Besides using str with every puzzle, you can also use traceback() with every error message (which this was not) and sometimes figure out where you should be looking more closely – IRTFM Jul 11 '11 at 18:25