Python Pandas Dataframe check column of lists and return ID from another Dataframe

Question

I have a pandas dataframe df1 which has an index, and a column of lists and looks like:

index   IDList
0   [1,3,5,7]
1   [2,4,5,8]
2   [6,8,9]
3   [1,2]

I have another pandas dataframe df2 which has NewID as the index, and a column of lists which looks like this:

NewID   IDList
1       [3]
2       [4,5]
3       [1,7]
4       [2]
5       [9,3]
6       [8]
7       [6]

What I need to do is if any of the items in df1.IDList exist in df2.IDList, then return a list of the relevant df2.NewID.

So the returned d1 dataframe would look like:

index   IDList      NewID
0       [1,3,5,7]   [3,1,2,3,5]
1       [2,4,5,8]   [4,2,2,6]
2       [6,8,9]     [7,6,5]
3       [1,2]       [3,4]

EDIT: Note that in df2 there can be ID in IDList that occurs in multiple rows (see ID 3 from df1.IDList and where ID 3 shows up in df2 rows 1 AND 5)

I was thinking some kind of np.where statement which incorporates 'any' and a list comprehension? but uncertain how to apply for each IDList in df1 where it looks at the whole of df2.IDList. Maybe some kind of .stack()? or .melt()? This would be easy in a spreadsheet with a vlookup of df2...

Help appreciated...

Psidom · Accepted Answer · 2017-04-15T19:04:40.470

1

# expand and map ids from IDList to NewID
flat_ids = pd.DataFrame({
    "NewID": pd.np.repeat(df2.NewID, df2.IDList.str.len().tolist()),
    "IDList": [x for l in df2.IDList for x in l]
}).set_index("IDList").NewID

# extract ids from flat ids using loc
df1['NewID'] = df1['IDList'].map(lambda x: flat_ids.loc[x].tolist())

edited Apr 15 '17 at 19:04

answered Apr 15 '17 at 18:21

Psidom

209,562
33
339
356

shoot, there may be duplicates in the IDList column from df2. I'll edit – clg4 Apr 15 '17 at 18:45
OK. I get it wrong. This should also work if there are duplicates in the *IDList* column. – Psidom Apr 15 '17 at 19:05
Getting: TypeError: repeat() takes 2 positional arguments but 3 were given – clg4 Apr 15 '17 at 19:33
Maybe you got some parenthesis unbalanced or which version of python and pandas are you using? on python 2.7.9 and pandas 0.19.2, it seems to work fine. – Psidom Apr 15 '17 at 19:48
python 3.4.5, pandas 17.1 – clg4 Apr 15 '17 at 19:49
You might consider upgrade your pandas version, that's pretty old. Not necessarily the problem though. – Psidom Apr 15 '17 at 20:02
It was the pandas version. Thanks so much, great job on a challenging ?. This works perfectly. – clg4 Apr 19 '17 at 02:41

Python Pandas Dataframe check column of lists and return ID from another Dataframe

1 Answers1