1

General newbie when it comes to time series data analysis in R. I am having trouble translating a bit of Stata code into R code for a replication project I am doing.

The intent of the Stata code and the Stata code (from the original analysis) are the following:

#### Delete extra yearc observations with different wartypes #####

drop if yearc==yearc[_n+1] & wartype!="CIVIL"
drop if yearc==yearc[_n-1] & wartype!="CIVIL"

So, translated, I keep the rows in which the country is having a civil war and delete the rows in which there is an interstate war during the same years.

I have named the data object (i.e., the data set)

mywar

in R.

I am assuming I somehow do a conditional ifelse statement, or something similar, such as:

invisible(mywar$yearc <- ifelse(mywar$yearc==n-1 | mywar$yearc==n+1 | mywar$wartype!=civil, NA, 
mywar$yearc))  # I am assuming I cannot condition ifelse statements like this; but, this is how I imagine it
mywar <- mywar[!is.na(mywar$yearc),]

EDIT: So perhaps an example

> b <- c(1970, 1970, 1970, 1971, 1982, 1999, 1999, 2000, 2001, 2002)
> c <- c("inter", "civil", "intra", "civil", "civil", "inter", "civil", "civil", "civil", "civil")
> df <- data.frame(b,c)
> df$j <- ifelse(df$b==n-1 & df$b==n+1 & df$c!="civil", NA, df$b)
> df
  b     c    j
1  1970 inter 1970
2  1970 civil 1970
3  1970 intra 1970
4  1971 civil 1971
5  1982 civil 1982
6  1999 inter 1999
7  1999 civil 1999
8  2000 civil 2000
9  2001 civil 2001
10 2002 civil 2002

So, what I was trying to do was create NAs for rows 1,3,and 6 as they are duplicate years in my logistic regression on the onset of civil war (I am not interested in inter and intra wars, however defined) so that I can delete these rows from my data set. Here, I just recreated row b. (Note, what is missing from this made up data are the country ids. But assume that these ten entries represent the same country (for instance, Somalia)). So, I am interested in how to delete these type of rows in a data set with 28,000 rows.

Frank
  • 66,179
  • 8
  • 96
  • 180
Joshua
  • 23
  • 4

3 Answers3

3

dplyr is also a good way — you just need to "keep" instead of "drop"

library(dplyr)
filter(df, (yearc != lead(yearc, 1) & yearc != lag(yearc, 1)) | wartype == "CIVIL")
Matthew
  • 2,628
  • 1
  • 20
  • 35
  • Nice answer. It's customary to put a `library(dplyr)` line in when using a package. You could also do something like `!(yearc%in%c(lead(yearc,1),lag(yearc,1)))` for the first clause, I guess. – Frank Jun 05 '15 at 00:15
1

You're focusing on Stata's if qualifier, but it sounds like you simply want to subset the data frame--hence your use of the drop command in Stata. I also learned Stata before R and was confused since I relied so heavily on the if qualifier in Stata and immediately pursued ifelse in R. But, I later realized that the more relevant technique in R revolved around subsetting. There is a subset() command, but most people prefer subsetting by using brackets (see code below).

In your original question you ask how to do two things:

  1. how to delete observations (i.e. rows) that are coded "inter" or "intra" on column C, and
  2. how to mark them as missing

Sample Data

b <- c(1970, 1970, 1970, 1971, 1982, 1999, 1999, 2000, 2001, 2002)
c <- c("inter", "civil", "intra", "civil", "civil", "inter", "civil", "civil", "civil", "civil")
df <- data.frame(b,c)
df
      b     c
1  1970 inter
2  1970 civil
3  1970 intra
4  1971 civil
5  1982 civil
6  1999 inter
7  1999 civil
8  2000 civil
9  2001 civil
10 2002 civil

1. Dropping Observations If you want to delete observations that are not "civil" in column C, you can subset the data frame to only keep those cases that are "civil":

df2 <- df[df$c=="civil",] 
df2
      b     c
2  1970 civil
4  1971 civil
5  1982 civil
7  1999 civil
8  2000 civil
9  2001 civil
10 2002 civil

The above code creates a new data frame, df2, that is a subset of df, but you can also completely overwrite the original data frame:

df <- df[df$c=="civil",] 

Or, you can generate a new one and then remove the old one, if you don't like your workspace cluttered with lots of data frames:

df2 <- df[df$c=="civil",]
rm(df)

2. Marking Observations as Missing If you want to mark observations that are not "civil" in column C, you can do that by overwriting them as NA:

df$c[df$c != "civil"] <- NA
df
      b     c
1  1970  <NA>
2  1970 civil
3  1970  <NA>
4  1971 civil
5  1982 civil
6  1999  <NA>
7  1999 civil
8  2000 civil
9  2001 civil
10 2002 civil

You could then use listwise deletion (see the na.omit() command) to remove the cases from whatever analyses you're doing.

Side Note Your original Stata code seeks to subset when column b is a duplicate and column c is "inter" or "intra". However, the way your sample data were presented, this seemed to be a redundant concern, which is why my solution above only looks at column c. However, if you want to match your Stata code as closely as possible, you can do that by

df <- df[order(df$b, df$c),]
df$duplicate <- duplicated(df$b)
df2 <- df[df$c=="civil" & df$duplicate==FALSE,] 

which

  1. orders the data chronologically by year and then alphabetically by war
  2. creates a new variable that specifies whether column b is a duplicate year
  3. subsets the data frame to remove undesirable cases.
coip
  • 1,312
  • 16
  • 30
0

Try changing your | operator to &. Here is some made up data:

R> b <- c(rep(1:4, each=3))
R> c <- 1:length(b)
R> df <- data.frame(c,b)
R> df$j <- ifelse(df$b != 2 & df$b != 3 & df$b != 1, NA, df$b)
R> df
    c b  j
1   1 1  1
2   2 1  1
3   3 1  1
4   4 2  2
5   5 2  2
6   6 2  2
7   7 3  3
8   8 3  3
9   9 3  3
10 10 4 NA
11 11 4 NA
12 12 4 NA

That last line of your code mywar <- mywar[!is.na(mywar$yearc),] should work fine as well

Stedy
  • 7,359
  • 14
  • 57
  • 77
  • Thanks for the reply. The ampersand works in that I don't get an error, but the code wasn't fully translated (or it doesn't work like it works in stata).`drop if yearc==yearc[_n+1] & wartype!="CIVIL"` somehow identifies the entries in which the same year was coded civil war and inter war for a particular country and drops the row for the inter war entry. The R code that I use doesn't drop anything. I am not sure now what I have to set `mywar$yearc== ` to to get the same dropped rows as the stata code. I think what I have is about a 90% solution. Again, thanks for the input, it helps a lot. – Joshua Nov 04 '14 at 04:43
  • 1
    You heave probably already seen this but when I learned R after learning stata I found this site immensely helpful: http://www.ats.ucla.edu/stat/r/faq/ – Stedy Nov 04 '14 at 04:51