I have tried to access data from The Cancer Genome Atlas (TCGA) hosted on AWS s3.
This code loads all the .tsv
files in TCGA using dask
from dask.distributed import Client
import numpy as np
import dask.dataframe as dd
client = Client() # to see access dast status UI
dfs = dd.read_csv('s3://tcga-2-open/*/*.tsv', sep='\t')
np.max(dfs.chromosome.values)
However, its output is empty:
The content of dfs.chromosome
is:
Dask Series Structure:
npartitions=21651
object
...
...
...
...
Name: chromosome, dtype: object
Dask Name: getitem, 43302 tasks
Do I use dask properly?
Just to clarify, there are many .tsv
files in the TCGA dataset, for example here is one of them:
$ aws s3 ls s3://tcga-2-open/0000093b-2b25-4781-9c21-7401eeb3ef88/ --no-sign-request
2020-04-29 21:27:41 3483218 TCGA-READ.72029f21-a40c-42f9-80ea-4c3d9d971279.gene_level_copy_number.tsv
In fact there are 21,651 .tsv files:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)
l = s3.glob('s3://tcga-2-open/*/*.tsv')
print(len(l))