3

Is there effective solution to select specific rows in Dask DataFrame? I would like to get only those rows which index is in a closed set (using the isin function is not enough efficient for me).

Are there any other effective solutions than ddf.loc[ddf.index.isin(list_of_index_values)] ddf.loc[~ddf.index.isin(list_of_index_values)] ?

mlech
  • 31
  • 3

1 Answers1

1

You can use the query method. You haven't provided a usable example but the format would be something like this

list_of_index_values = [6, 3]
dff.query('column in @list_of_index_values')

EDIT: Just for fun. I did this in pandas but I wouldn't expect much variance.

No clue whats stored in the index but assumed int.

from random import randint
import pandas as pd
from datetime import datetime as dt

# build huge random dataset
lst = []
for i in range(100000000):
    lst.append(randint(0,100000))

# build huge random index    
index = []
for i in range(1000000):
    index.append(randint(0,100000))

df = pd.DataFrame(lst, columns=['values'])

isin = dt.now()
df[df['values'].isin(index)]
print(f'total execution time for isin {dt.now()-isin}')

query = dt.now()
df.query('values in @index')
print(f'total execution time for query {dt.now()-query}')

# total execution time for isin 0:01:22.914507
# total execution time for query 0:01:13.794499

If your index is sequential however

time = dt.now()
df[df['values']>100000]
print(dt.now()-time)
# 0:00:00.128209

It's not even close. You can even build out a range

time = dt.now()
df[(df['values']>100000) | (df['values'] < 500)]
print(dt.now()-time)
# 0:00:00.650321

Obviously the third method isn't always an option, but something to keep in mind if speed is a priority and you just need index between 2 values or some such.

fthomson
  • 773
  • 3
  • 9