1

I'm working on a dataframe (account) with two columns containing "posting" IP location (in the column city) and the locations at the time when those accounts were first registered (in the column register). I'm using grepl() to subset rows whose posting location and register location are both from the state of New York (NY). Below are part of the data and my code for subsetting the desired output:

account <- data.frame(city = c("Beijing, China", "New York, NY", "Hoboken, NJ", "Los Angeles, CA", "New York, NY", "Bloomington, IN"),
register = c("New York, NY", "New York, NY", "Wilwaukee, WI", "Rochester, NY", "New York, NY", "Tokyo, Japan"))

sub_data <- subset(account, grepl("NY", city) == "NY" & grepl("NY", register) == "NY")

sub_data
[1] city     register
<0 rows> (or 0-length row.names)

My code didn't work and returned 0 row (while at least two rows should have met my selection criterion). What went wrong in my code? I have referenced this previous thread before lodging this question.

Chris T.
  • 1,699
  • 7
  • 23
  • 45
  • Not able to reproduce the issue with the example. I get 4 rows with that code – akrun May 06 '19 at 16:05
  • Try: `subset(account, grepl("NY", city) & grepl("NY", register))`. Currently, your code will return cases that are both NY and both not NY. – Ritchie Sacramento May 06 '19 at 16:07
  • @akrun It indeed does not produce the issue as described by the OP, however `grepl("NY", city) == grepl("NY", register)` will also return the rows both *not* having NY (in this case rows 3 & 6) whereas `grepl("NY", city) & grepl("NY", register)` seems what the OP wants – CodeNoob May 06 '19 at 16:08
  • It's been edited, sorry for the confusion and I'm trying out the solutions you suggested. – Chris T. May 06 '19 at 16:10
  • 1
    @ChrisT. `grepl` returns either `TRUE` or `FALSE` as described in `?grepl`: "grepl returns a logical vector (match or not for each element of x)" so it makes no sense to compare this to a character (i.e. NY) as you did in `grepl("NY", city) == "NY"` instead you want to check whether **both** are true as suggested by @H1 – CodeNoob May 06 '19 at 16:12
  • @CodeNoob many thanks, you recommended method works better for large dataset like the one I have. – Chris T. May 06 '19 at 16:16

1 Answers1

1

The function grepl already returns a logical vector, so just use the following:

sub_data <- subset(account, 
                   grepl("NY", city) & grepl("NY", register)
                   )

By using something like grepl("NY", city) == "NY" you are asking R if any values in FALSE TRUE FALSE FALSE TRUE FALSE are equal to "NY", which is of course false.