need to flatten list to use intersect in R

Question

I have fullname data that I have used strsplit() to get each element of the name.

# Dataframe with a `names` column (complete names)
df <- data.frame(
    names =
          c("Adam, R, Goldberg, MALS, MBA", 
          "Adam, R, Goldberg, MEd", 
          "Adam, S, Metsch, MBA", 
          "Alan, Haas, MSW", 
          "Alexandra, Dumas, Rhodes, MA", 
          "Alexandra, Ruttenberg, PhD, MBA"),
    stringsAsFactors=FALSE)

# Add a column with the split names (it is actually a list)
df$splitnames <- strsplit(df$names, ', ')

I also have a list of degrees below

degrees<-c("EdS","DEd","MEd","JD","MS","MA","PhD","MSPH","MSW","MSSA","MBA",
           "MALS","Esq","MSEd","MFA","MPA","EdM","BSEd")

I would like to get the intersection for each name and respective degrees.

I'm not sure how to flatten the name list so I can compare the two vectors using intersect. When I tried unlist(df$splitname,recursive=F) it returned each element separately. Any help is appreciated.

`lapply(df$splitname, intersection, degrees)`? – mnel Feb 20 '13 at 04:29 — mnel, Feb 20 '13 at 04:29
@agstudy, yes. untested (and thus typo) – mnel Feb 20 '13 at 04:34 — mnel, Feb 20 '13 at 04:34

Oscar de León · Accepted Answer · 2013-02-20T05:47:16.077

3

Try

df$intersect <- lapply(X=df$splitname, FUN=intersect, y=degrees)

That will give you a list of the intersection of each element in df$splitname (e.g. intersect(df$splitname[[1]], degrees)). If you want it as a vector:

sapply(X=df$intersect, FUN=paste, collapse=', ')

I assume you need it as a vector, since possibly the complete names came from one (for instance, from a dataframe), but strsplit outputs a list.

Does that work? If not, please try to clarify your intention.

Good luck!

edited Feb 20 '13 at 05:47

answered Feb 20 '13 at 04:31

Oscar de León

2,331
16
18

The `unlist(df$intersect)` would not have the suggested consequences, since each name can contain one or more degrees. I fixed it in the post. – Oscar de León Feb 20 '13 at 04:58
Thanks a lot! I was thinking about it the wrong. The added hint about making it a vector helps too. – user1495088 Feb 20 '13 at 15:30

score 0 · Answer 2 · answered Feb 20 '13 at 04:42

0

For continuity, you can use unlist :

hh <- unlist(df$splitname)
intersect(hh,degrees)

For example :

ll <- list(c("Adam" ,    "R"    ,    "Goldberg" ,"MALS"  , "MBA "),
           c("Adam" ,    "R"    ,    "Goldberg", "MEd" ))

 intersect(hh,degrees)
[1] "MEd"

or equivalent to :

hh[hh %in% degrees]
[1] "MEd"

To get differences you can use

setdiff(hh,degrees)
[1] "Adam"     "R"        "Goldberg" "MALS"     "MBA "

...

answered Feb 20 '13 at 04:42

agstudy

119,832
17
199
261

Since the problem appears to be determining the degree from each name, answers should be assigned to the same records (*i.e.* degrees associated with originating name), so unlist really is not an option because any name could have more than one degree (in fact, it happens in the example data). – Oscar de León Feb 20 '13 at 08:00

need to flatten list to use intersect in R

2 Answers2