How to use dask to load and process many (>20k) .tsv files directly from an s3 bucket?

Question

I have tried to access data from The Cancer Genome Atlas (TCGA) hosted on AWS s3.

This code loads all the .tsv files in TCGA using dask

from dask.distributed import Client
import numpy as np
import dask.dataframe as dd

client = Client() # to see access dast status UI
dfs = dd.read_csv('s3://tcga-2-open/*/*.tsv', sep='\t')
np.max(dfs.chromosome.values)

However, its output is empty:

The content of dfs.chromosome is:

Dask Series Structure:
npartitions=21651
    object
       ...
     ...  
       ...
       ...
Name: chromosome, dtype: object
Dask Name: getitem, 43302 tasks

Do I use dask properly?

Just to clarify, there are many .tsv files in the TCGA dataset, for example here is one of them:

$ aws s3 ls s3://tcga-2-open/0000093b-2b25-4781-9c21-7401eeb3ef88/ --no-sign-request
2020-04-29 21:27:41 3483218   TCGA-READ.72029f21-a40c-42f9-80ea-4c3d9d971279.gene_level_copy_number.tsv

In fact there are 21,651 .tsv files:

import s3fs
s3 = s3fs.S3FileSystem(anon=False)
l = s3.glob('s3://tcga-2-open/*/*.tsv')
print(len(l))

score 1 · Accepted Answer · answered Nov 26 '20 at 03:02

1

I see files named like

's3://tcga-2-open/0040de75-8b0b-4954-9050-58063996b02e/LEONE_p_TCGA_103_243_257_N_GenomeWideSNP_6_C07_1300632.nocnv_grch38.seg.txt'

i.e., with a "txt" ending, not "tsv". Probably the time spent by dask was simply listing the contents of the one million data directories (this should be improved with the latest unreleased version of s3fs).

answered Nov 26 '20 at 03:02

mdurant

27,272
5
45
74

not quite true: `aws s3 ls s3://tcga-2-open/0000093b-2b25-4781-9c21-7401eeb3ef88/ --no-sign-request` will output `2020-04-29 21:27:41 3483218 TCGA-READ.72029f21-a40c-42f9-80ea-4c3d9d971279.gene_level_copy_number.tsv` – 0x90 Nov 26 '20 at 03:04
1

Perhaps you were hit by the "1000 keys" bug; see what `s3.glob('s3://tcga-2-open/*/*.tsv')` gets you. – mdurant Nov 26 '20 at 14:20
how do you define `s3`? When I do `s3 = boto3.resource('s3')` it doesn't work. – 0x90 Nov 26 '20 at 14:29
1

`s3 = s3fs.S3FileSystem()` – mdurant Nov 26 '20 at 14:31
it has 21,651 such .tsv files. – 0x90 Nov 26 '20 at 15:38
Is [this](https://stackoverflow.com/questions/54314563/how-to-get-more-than-1000-objects-from-s3-by-using-list-objects-v2/54314628) the way to resolve it? any suggestion how to make it the right way for dask? – 0x90 Nov 29 '20 at 15:20
Use the latest s3fs. You *could* use boto to get the list of files and pass this explicitly to Dask, if you want, but s3fs should be able to do it for you in one step. – mdurant Nov 30 '20 at 13:59
If you get the chance to share a simple excerpt that shows this in one step it will be useful. – 0x90 Nov 30 '20 at 14:03
1

As you did, but for s3fs from master (`pip install https://github.com/dask/s3fs`, perhaps with upgrade). – mdurant Nov 30 '20 at 14:34

How to use dask to load and process many (>20k) .tsv files directly from an s3 bucket?

1 Answers1