The order of row.txt.gz and matrix.txt.gz files is identical. My purpose is to extract by some rows from dask dataframe from 'row.txt.gz' and then extract rows from matrix.txt.gz using exactly the same index.
# ROWS
rows = dd.read_csv('*.row.txt.gz', sep='\t', compression='gzip', blocksize=None)
# MTX
mx = dd.read_csv('*.matrix.txt.gz', sep='\t', compression='gzip', blocksize=None)
# query
query_row_file = 'query.txt.gz'
query = pd.read_table(query_row_file , dtype=object, delimiter='\t')
# extract the data from rows
rows_queried = rows[rows['inchikey'].isin(query.METID)]
# Use index from rows_queried for 'mx'. How?
mx_queried = mx[rows_queried.index]
mx_queried = mx_queried.compute()
I got the following error, I missing something in my logic. Just started using dask, any help would be very appreciated!
pandas/core/indexing.py", line 1269, in _convert_to_indexer .format(mask=objarr[mask])) KeyError: "Int64Index([0, 16, 60, 88, 104, 131, 132, 149, 163, 179, 188, 204, 233, 261,\n 262, 293],\n dtype='int64') not in index"