0

I have a koalas data frame with approx. 6 million rows in it. I need to perform an operation where I read every row in the data frame, and extract the values of each row and then do a lookup in a list(That list has 30 K elements in it ). If it is found return true else false, and create a boolean array as output.

I know one simple way to do that is iterate over every single row using iterrows() method. But it is time-consuming. Looking for a recommendation that can make the process faster.

For e.g sample data frame is--

        species     population
panda     bear          1864
polar     bear          22000
koala     marsupial     80000

Now I have a list which has a combination of values from my column, Get values of every row like (bear,1864), if found in test list append true to a list, if not false

test_list =[(bear,189), (bear,1864) , (marsupial,9), ..... ]

test_list length is approx 30k

A sample output will be

output = [True, False, False]

every single row of sample data frame is checked, the first row has values (bear, 1864) so the output list has true as the first element. The second row has values (bear, 1864) which is not present in list. Hence False is appended to the output list and so on.

1 Answers1

0

I think you are looking for apply functions. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

[IN] test_list = [("bear", 189), ("bear", 1864), ("marsupial", 9)]
     df = ks.DataFrame([["bear", 1864], ["bear", 22000], ["marsupial", 80000]], 
                  columns=["species", "population"], 
                  index = ["panda", "polar", "koala"])
     df.apply(lambda row: (row["species"], row["population"]) in test_list, axis=1).to_list()

[OUT] 
    [True, False, False]
Josh Herzberg
  • 318
  • 1
  • 13