-2

After searching for a while I haven't found an elegant solution to this (usually pedantic answers like "just vectorize it" which may not apply all the time), so I thought I'd ask.

The simple problem is this: I need to loop over 2 control variables. (this is what's usually asked, and answered curtly)

The real (specific) problem I have, which may not apply to everyone (looking for an answer to this type of question) is this: I have a data frame. Lets say it's payroll data.

ID,FIRST_NAME,LAST_NAME,PAYDATE,AMT
912367,Jim,Smith,1/1/2000,5000
1102467,LAURA,JAMES,1/1/2000,5000
812367,DAVID,johnson,1/1/2000,5000
555555,ian,Smith,1/1/2000,5000
912367,Jim,SMITH,1/8/2000,4000
...

And yes, the names are dirty like that. Say Unnamed Boss comes around and says, do some stuff with this and other data... and gives you a list of names. Of course, they're properly formatted:

Smith,Jim R
Fields,Samantha
Smith,Kelly
Lensdotter,Patricia

I chose to break them (easy in a csv) to read them in as something similar to

fnames <- c(Jim,Samantha,Kelly,Patricia)

and associated last names (i.e. 2 variables). Then I read in the dataframe, did some nested loops and greps (to ignore case). Searched on easier ways and found how to "python zip" the lists, etc. but I was wondering if there was an easier way?

my code is very similar to:

EID <- vector(mode="integer")
for (i in 1:length(lnames)){
  l <- lnames[i]
  f <- fnames[i]
  if(grepl(l,payroll[3],ignore.case = T)){
    paycut1 <- payroll[grepl(l,payroll$LAST_NAME,ignore.case = T),]
    if(grepl(f,paycut[2],ignore.case=T)){
      paycut2 <- paycut[grepl(f,paycut$FIRST_NAME,ignore.case=T),]
    }
    print(paste0(l,", ",f," Has EID: ", paycut2[1,1]))
    EIDs <- c(EIDs,paycut2[1,1])
  }else{
    print(paste0(l,", ",f," NOT in Payroll Data: "))
  }
}

so I can grab the ID's out of the file associated with the names (so I don't have to deal with names!). Any suggestions? (I don't want to have to use the for (i in range): construct (kind of inelegant) as opposed to a more c/python like for i,j: construct.

(Sorry for the explanation at the beginning, but I think that searching for a question like this deserves an answer, and not everyone can frame a question right, so answers like "just vectorize it" which may not apply in their situation dissuades them from continuing to ask)

P.S. If I'm going about it the completely wrong way, I'm not averse to other points of view. I come from a C background, so I'm used to loops and non-vectorized code. I just couldn't see how to vectorize this. Criticism, though only helpful criticism, is welcomed.

Rufus Shinra
  • 383
  • 1
  • 9
  • why not do `payroll$fullname <- tolower(paste(payroll$LAST_NAME,payroll$FIRST_NAME,sep=", "))` ; `names <- data.frame(fullname = tolower(paste(lname,fname,sep=", "))` and `merge(payroll,names,all.x=FALSE,all.y=TRUE)` – scoa Nov 11 '15 at 18:11
  • @scoa, why not post that as an answer? OP: your code doesn't actually run -- the variable `n` doesn't exist, `EMPcut2` doesn't exist ... can you clean it up so it runs? – Ben Bolker Nov 11 '15 at 18:17
  • Sorry, the EMPcut was from the original cut and pasted code, as was the n should run now. forgot that one. – Rufus Shinra Nov 11 '15 at 19:00

1 Answers1

1

Just vectorize it!

More seriously, your code doesn't really look like R code - you really don't want to nest loops if you can help it.

Here's how I would approach this.

First we clean up the names:

payroll$FIRST_NAME <- toupper(payroll$FIRST_NAME)
payroll$LAST_NAME <- toupper(payroll$LAST_NAME)
names$V2 <- toupper(sub(" .*", "", names$V2))
names$V1 <- toupper(names$V1)

Then we can get those that match using an inner_join:

library(dplyr)
inner_join(names, payroll, by = c(V2 = "FIRST_NAME", V1 = "LAST_NAME"))

     V1  V2     ID  PAYDATE  AMT
1 SMITH JIM 912367 1/1/2000 5000
2 SMITH JIM 912367 1/8/2000 4000

And those that do not match, using an anti_join:

anti_join(names, payroll, by = c(V2 = "FIRST_NAME", V1 = "LAST_NAME"))
          V1       V2
1      SMITH    KELLY
2 LENSDOTTER PATRICIA
3     FIELDS SAMANTHA

Here's how I got the data in:

payroll <- read.table(text = "ID,FIRST_NAME,LAST_NAME,PAYDATE,AMT
912367,Jim,Smith,1/1/2000,5000
1102467,LAURA,JAMES,1/1/2000,5000
812367,DAVID,johnson,1/1/2000,5000
555555,ian,Smith,1/1/2000,5000
912367,Jim,SMITH,1/8/2000,4000", header=TRUE, sep = ",")


names <- read.table(text="Smith,Jim R
Fields,Samantha
Smith,Kelly
Lensdotter,Patricia", header=FALSE, sep = ",")
jeremycg
  • 24,657
  • 5
  • 63
  • 74
  • There's no nested loops, just nested test statements. I see how you're getting the data in, but I can't see the IDs on the out. was there more to this? I need the ID's to perform actions on other dataframes – Rufus Shinra Nov 11 '15 at 19:02
  • 1
    You should be getting out data frames - put a `mydat <- ` at the start of the line, then you can acess IDs using `mydat$ID` – jeremycg Nov 11 '15 at 19:30
  • Thanks! just what I was looking for. – Rufus Shinra Nov 12 '15 at 00:09