How do you check if all the values in a column in a dataframe exist in another column in another dataframe using Vaex?

Question

I have a dataframe with 160,000 rows and I need to know if these values exist in another column in another different dataframe that has over 7 million rows using Vaex.

I have tried doing this in pandas but it takes way too long to run.

Once I run this code I would like a list or a column that says either "True" or "False" about whether the value exists.

Please provide enough code so others can better understand or reproduce the problem. — Community, Oct 17 '22 at 20:13

score 0 · Answer 1 · answered Oct 21 '22 at 21:10

There are few tricks you can do.

Some ideas:

you can try inner join, and then get the list of unique values, which appear in both dataframes. Then you can use the isin method in the smaller dataframe and that list to get your answer.

Dunno if this will work out of the box, but it would be something like:

df_join = df_small.join(df_big, on='key', allow_duplicates=True)
common_samples = df_join[key].tolist()

df_small['is_in_df_big'] = df_small.key.isin(common_samples) 

# If it is something you gonna reuse a lot, but be worth doing
df_small = df_small.materialize('is_in_df_big') # to put it in memory otherwise it will be lazily recomputed each time you need it.

Similar idea: instead of doing join do something like:

unique_samples = df_small.key.unique()
common_samples = df_big[df_big.key.isin(unique_samples)].key.unqiue()
df_small['is_in_df_big'] = df_small.key.isin(common_samples)

I dunno which one would be faster. I hope this at least will lead to some inspiration if not to the full solution.

I will give this a try, but I hope the same syntax is used in Vaex. That is the main problem I am having. — Lorenzo Correa, Oct 25 '22 at 16:21
The examples above assume vaex dataframes etc... (I could have made a typo or mistake of course, but I wrote that example with vaex in mind) — Joco, Oct 25 '22 at 18:17

How do you check if all the values in a column in a dataframe exist in another column in another dataframe using Vaex?

1 Answers1