I have a huge DataFrame (~4 million rows) and I need to search it for a row which has specific columns values for about a million time. Based on the conditions governing my problem, there is only one true answer (one row) for each query. So as soon as the search finds the first result, there's no need to continue the search. But as we know df.loc[df['column']==value]
has to read all the data every time! even if the first row satisfies the search conditions, other 4 million rows has to be read and evaluated! which creates a huge overhead for the search. Is there a way to get the first row satisfying the search conditions without reading and evaluating the rest of the rows?
Asked
Active
Viewed 213 times
0

Nick ODell
- 15,465
- 3
- 32
- 66

MHDMYZ
- 1
- 2
-
set that column into index: `df.set_index('column', inplace=True)`. Then you can do `df.loc[value]`. – Quang Hoang Dec 01 '20 at 21:13
1 Answers
0
firstly you have to set that column as index (as you said you have no duplicated value). then change your data frame to dictionary, and then search your value.
In [1]: import numpy as np, pandas as pd
...:
...: np.random.seed(4)
...: h = 100
...: small_df = pd.DataFrame(np.random.randint(1,1000000,h).reshape(h//4,4))
...: small_df = small_df.set_index(3)
...: small_df.index = small_df.index.astype(str)
...: small_df = small_df.loc[small_df.index.drop_duplicates()]
...: small_df = small_df.T.to_dict()
...:
...:
...: np.random.seed(4)
...: h = h*100000
...: big_df = pd.DataFrame(np.random.randint(1,1000000000,h).reshape(h//4,4))
...: big_df = big_df.set_index(3)
...: big_df.index = big_df.index.astype(str)
...: big_df = big_df.T.to_dict()
/home/amir/.local/bin/ipython3:17: UserWarning: DataFrame columns are not unique, some columns will be omitted.
len(small_df)
In [2]: len(small_df)
Out[2]: 25
In [3]: len(big_df)
Out[3]: 2496856
In [6]: %time small_df['890932']
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.15 µs
Out[6]: {0: 962341, 1: 751580, 2: 181308}
In [7]: %time big_df ['115865608']
CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 7.39 µs
Out[7]: {0: 448609773, 1: 372731489, 2: 452798904}

Amir saleem
- 1,404
- 1
- 8
- 11