0

I have a dataframe with 160,000 rows and I need to know if these values exist in another column in another different dataframe that has over 7 million rows using Vaex.

I have tried doing this in pandas but it takes way too long to run.

Once I run this code I would like a list or a column that says either "True" or "False" about whether the value exists.

1 Answers1

0

There are few tricks you can do.

Some ideas:

  • you can try inner join, and then get the list of unique values, which appear in both dataframes. Then you can use the isin method in the smaller dataframe and that list to get your answer.

Dunno if this will work out of the box, but it would be something like:

df_join = df_small.join(df_big, on='key', allow_duplicates=True)
common_samples = df_join[key].tolist()

df_small['is_in_df_big'] = df_small.key.isin(common_samples) 

# If it is something you gonna reuse a lot, but be worth doing
df_small = df_small.materialize('is_in_df_big') # to put it in memory otherwise it will be lazily recomputed each time you need it. 

Similar idea: instead of doing join do something like:

unique_samples = df_small.key.unique()
common_samples = df_big[df_big.key.isin(unique_samples)].key.unqiue()
df_small['is_in_df_big'] = df_small.key.isin(common_samples) 

I dunno which one would be faster. I hope this at least will lead to some inspiration if not to the full solution.

Joco
  • 803
  • 4
  • 7
  • I will give this a try, but I hope the same syntax is used in Vaex. That is the main problem I am having. – Lorenzo Correa Oct 25 '22 at 16:21
  • The examples above assume vaex dataframes etc... (I could have made a typo or mistake of course, but I wrote that example with vaex in mind) – Joco Oct 25 '22 at 18:17