Dask: isin with further use of index to another dask dataframe

Question

The order of row.txt.gz and matrix.txt.gz files is identical. My purpose is to extract by some rows from dask dataframe from 'row.txt.gz' and then extract rows from matrix.txt.gz using exactly the same index.

# ROWS
rows = dd.read_csv('*.row.txt.gz', sep='\t', compression='gzip', blocksize=None)
# MTX
mx = dd.read_csv('*.matrix.txt.gz', sep='\t', compression='gzip', blocksize=None)
# query
query_row_file = 'query.txt.gz'
query = pd.read_table(query_row_file , dtype=object, delimiter='\t')
# extract the data from rows
rows_queried = rows[rows['inchikey'].isin(query.METID)]
# Use index from rows_queried for 'mx'. How?
mx_queried = mx[rows_queried.index]
mx_queried = mx_queried.compute()

I got the following error, I missing something in my logic. Just started using dask, any help would be very appreciated!

pandas/core/indexing.py", line 1269, in _convert_to_indexer .format(mask=objarr[mask])) KeyError: "Int64Index([0, 16, 60, 88, 104, 131, 132, 149, 163, 179, 188, 204, 233, 261,\n 262, 293],\n dtype='int64') not in index"

Are you by chance calling a `get_pipeline` function more than once? Are you passing different predictors to that `get_pipeline` function in each call? — kas, Feb 22 '18 at 16:19

Dask: isin with further use of index to another dask dataframe

0 Answers0