I have a koalas data frame with approx. 6 million rows in it. I need to perform an operation where I read every row in the data frame, and extract the values of each row and then do a lookup in a list(That list has 30 K elements in it ). If it is found return true else false, and create a boolean array as output.
I know one simple way to do that is iterate over every single row using iterrows() method. But it is time-consuming. Looking for a recommendation that can make the process faster.
For e.g sample data frame is--
species population
panda bear 1864
polar bear 22000
koala marsupial 80000
Now I have a list which has a combination of values from my column, Get values of every row like (bear,1864), if found in test list append true to a list, if not false
test_list =[(bear,189), (bear,1864) , (marsupial,9), ..... ]
test_list length is approx 30k
A sample output will be
output = [True, False, False]
every single row of sample data frame is checked, the first row has values (bear, 1864) so the output list has true as the first element. The second row has values (bear, 1864) which is not present in list. Hence False is appended to the output list and so on.