I'm trying to parallelize the reading the content of 16 gzip files with script:
import gzip
import glob
from dask import delayed
from dask.distributed import Client, LocalCluster
@delayed
def get_gzip_delayed(gzip_file):
with gzip.open(gzip_file) as f:
reads = f.readlines()
reads = [read.decode("utf-8") for read in reads]
return reads
if __name__ == "__main__":
cluster = LocalCluster()
client = Client(cluster)
read_files = glob.glob("*.txt.gz")
all_files = []
for file in read_files:
reads = get_gzip_delayed(file)
all_files.extend(reads)
with open("all_reads.txt", "w") as f:
w = delayed(all_files.writelines)(f)
w.compute()
However, I get the following error:
> TypeError: Delayed objects of unspecified length are not iterable
How do I parallelize a for loop with extend/append and writing the function to a doc. All dask examples always include some final function performed on for loop product.